Dissertations / Theses on the topic 'Data mining'

To see the other types of publications on this topic, follow the link: Data mining.

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 dissertations / theses for your research on the topic 'Data mining.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Mrázek, Michal. "Data mining." Master's thesis, Vysoké učení technické v Brně. Fakulta strojního inženýrství, 2019. http://www.nusl.cz/ntk/nusl-400441.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
The aim of this master’s thesis is analysis of the multidimensional data. Three dimensionality reduction algorithms are introduced. It is shown how to manipulate with text documents using basic methods of natural language processing. The goal of the practical part of the thesis is to process real-world data from the internet forum. Posted messages are transformed to the numerical representation, then to two-dimensional space and visualized. Later on, topics of the messages are discovered. In the last part, a few selected algorithms are compared.
2

Payyappillil, Hemambika. "Data mining framework." Morgantown, W. Va. : [West Virginia University Libraries], 2005. https://etd.wvu.edu/etd/controller.jsp?moduleName=documentdata&jsp%5FetdId=3807.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Thesis (M.S.)--West Virginia University, 2005
Title from document title page. Document formatted into pages; contains vi, 65 p. : ill. (some col.). Includes abstract. Includes bibliographical references (p. 64-65).
3

Abedjan, Ziawasch. "Improving RDF data with data mining." Phd thesis, Universität Potsdam, 2014. http://opus.kobv.de/ubp/volltexte/2014/7133/.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Linked Open Data (LOD) comprises very many and often large public data sets and knowledge bases. Those datasets are mostly presented in the RDF triple structure of subject, predicate, and object, where each triple represents a statement or fact. Unfortunately, the heterogeneity of available open data requires significant integration steps before it can be used in applications. Meta information, such as ontological definitions and exact range definitions of predicates, are desirable and ideally provided by an ontology. However in the context of LOD, ontologies are often incomplete or simply not available. Thus, it is useful to automatically generate meta information, such as ontological dependencies, range definitions, and topical classifications. Association rule mining, which was originally applied for sales analysis on transactional databases, is a promising and novel technique to explore such data. We designed an adaptation of this technique for min-ing Rdf data and introduce the concept of “mining configurations”, which allows us to mine RDF data sets in various ways. Different configurations enable us to identify schema and value dependencies that in combination result in interesting use cases. To this end, we present rule-based approaches for auto-completion, data enrichment, ontology improvement, and query relaxation. Auto-completion remedies the problem of inconsistent ontology usage, providing an editing user with a sorted list of commonly used predicates. A combination of different configurations step extends this approach to create completely new facts for a knowledge base. We present two approaches for fact generation, a user-based approach where a user selects the entity to be amended with new facts and a data-driven approach where an algorithm discovers entities that have to be amended with missing facts. As knowledge bases constantly grow and evolve, another approach to improve the usage of RDF data is to improve existing ontologies. Here, we present an association rule based approach to reconcile ontology and data. Interlacing different mining configurations, we infer an algorithm to discover synonymously used predicates. Those predicates can be used to expand query results and to support users during query formulation. We provide a wide range of experiments on real world datasets for each use case. The experiments and evaluations show the added value of association rule mining for the integration and usability of RDF data and confirm the appropriateness of our mining configuration methodology.
Linked Open Data (LOD) umfasst viele und oft sehr große öffentlichen Datensätze und Wissensbanken, die hauptsächlich in der RDF Triplestruktur bestehend aus Subjekt, Prädikat und Objekt vorkommen. Dabei repräsentiert jedes Triple einen Fakt. Unglücklicherweise erfordert die Heterogenität der verfügbaren öffentlichen Daten signifikante Integrationsschritte bevor die Daten in Anwendungen genutzt werden können. Meta-Daten wie ontologische Strukturen und Bereichsdefinitionen von Prädikaten sind zwar wünschenswert und idealerweise durch eine Wissensbank verfügbar. Jedoch sind Wissensbanken im Kontext von LOD oft unvollständig oder einfach nicht verfügbar. Deshalb ist es nützlich automatisch Meta-Informationen, wie ontologische Abhängigkeiten, Bereichs-und Domänendefinitionen und thematische Assoziationen von Ressourcen generieren zu können. Eine neue und vielversprechende Technik um solche Daten zu untersuchen basiert auf das entdecken von Assoziationsregeln, welche ursprünglich für Verkaufsanalysen in transaktionalen Datenbanken angewendet wurde. Wir haben eine Adaptierung dieser Technik auf RDF Daten entworfen und stellen das Konzept der Mining Konfigurationen vor, welches uns befähigt in RDF Daten auf unterschiedlichen Weisen Muster zu erkennen. Verschiedene Konfigurationen erlauben uns Schema- und Wertbeziehungen zu erkennen, die für interessante Anwendungen genutzt werden können. In dem Sinne, stellen wir assoziationsbasierte Verfahren für eine Prädikatvorschlagsverfahren, Datenvervollständigung, Ontologieverbesserung und Anfrageerleichterung vor. Das Vorschlagen von Prädikaten behandelt das Problem der inkonsistenten Verwendung von Ontologien, indem einem Benutzer, der einen neuen Fakt einem Rdf-Datensatz hinzufügen will, eine sortierte Liste von passenden Prädikaten vorgeschlagen wird. Eine Kombinierung von verschiedenen Konfigurationen erweitert dieses Verfahren sodass automatisch komplett neue Fakten für eine Wissensbank generiert werden. Hierbei stellen wir zwei Verfahren vor, einen nutzergesteuertenVerfahren, bei dem ein Nutzer die Entität aussucht die erweitert werden soll und einen datengesteuerten Ansatz, bei dem ein Algorithmus selbst die Entitäten aussucht, die mit fehlenden Fakten erweitert werden. Da Wissensbanken stetig wachsen und sich verändern, ist ein anderer Ansatz um die Verwendung von RDF Daten zu erleichtern die Verbesserung von Ontologien. Hierbei präsentieren wir ein Assoziationsregeln-basiertes Verfahren, der Daten und zugrundeliegende Ontologien zusammenführt. Durch die Verflechtung von unterschiedlichen Konfigurationen leiten wir einen neuen Algorithmus her, der gleichbedeutende Prädikate entdeckt. Diese Prädikate können benutzt werden um Ergebnisse einer Anfrage zu erweitern oder einen Nutzer während einer Anfrage zu unterstützen. Für jeden unserer vorgestellten Anwendungen präsentieren wir eine große Auswahl an Experimenten auf Realweltdatensätzen. Die Experimente und Evaluierungen zeigen den Mehrwert von Assoziationsregeln-Generierung für die Integration und Nutzbarkeit von RDF Daten und bestätigen die Angemessenheit unserer konfigurationsbasierten Methodologie um solche Regeln herzuleiten.
4

Liu, Tantan. "Data Mining over Hidden Data Sources." The Ohio State University, 2012. http://rave.ohiolink.edu/etdc/view?acc_num=osu1343313341.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Taylor, Phillip. "Data mining of vehicle telemetry data." Thesis, University of Warwick, 2015. http://wrap.warwick.ac.uk/77645/.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Driving a safety critical task that requires a high level of attention and workload from the driver. Despite this, people often perform secondary tasks such as eating or using a mobile phone, which increase workload levels and divert cognitive and physical attention from the primary task of driving. As well as these distractions, the driver may also be overloaded for other reasons, such as dealing with an incident on the road or holding conversations in the car. One solution to this distraction problem is to limit the functionality of in-car devices while the driver is overloaded. This can take the form of withholding an incoming phone call or delaying the display of a non-urgent piece of information about the vehicle. In order to design and build these adaptions in the car, we must first have an understanding of the driver's current level of workload. Traditionally, driver workload has been monitored using physiological sensors or camera systems in the vehicle. However, physiological systems are often intrusive and camera systems can be expensive and are unreliable in poor light conditions. It is important, therefore, to use methods that are non-intrusive, inexpensive and robust, such as sensors already installed on the car and accessible via the Controller Area Network (CAN)-bus. This thesis presents a data mining methodology for this problem, as well as for others in domains with similar types of data, such as human activity monitoring. It focuses on the variable selection stage of the data mining process, where inputs are chosen for models to learn from and make inferences. Selecting inputs from vehicle telemetry data is challenging because there are many irrelevant variables with a high level of redundancy. Furthermore, data in this domain often contains biases because only relatively small amounts can be collected and processed, leading to some variables appearing more relevant to the classification task than they are really. Over the course of this thesis, a detailed variable selection framework that addresses these issues for telemetry data is developed. A novel blocked permutation method is developed and applied to mitigate biases when selecting variables from potentially biased temporal data. This approach is infeasible computationally when variable redundancies are also considered, and so a novel permutation redundancy measure with similar properties is proposed. Finally, a known redundancy structure between features in telemetry data is used to enhance the feature selection process in two ways. First the benefits of performing raw signal selection, feature extraction, and feature selection in different orders are investigated. Second, a two-stage variable selection framework is proposed and the two permutation based methods are combined. Throughout the thesis, it is shown through classification evaluations and inspection of the features that these permutation based selection methods are appropriate for use in selecting features from CAN-bus data.
6

Sherikar, Vishnu Vardhan Reddy. "I2MAPREDUCE: DATA MINING FOR BIG DATA." CSUSB ScholarWorks, 2017. https://scholarworks.lib.csusb.edu/etd/437.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
This project is an extension of i2MapReduce: Incremental MapReduce for Mining Evolving Big Data . i2MapReduce is used for incremental big data processing, which uses a fine-grained incremental engine, a general purpose iterative model that includes iteration algorithms such as PageRank, Fuzzy-C-Means(FCM), Generalized Iterated Matrix-Vector Multiplication(GIM-V), Single Source Shortest Path(SSSP). The main purpose of this project is to reduce input/output overhead, to avoid incurring the cost of re-computation and avoid stale data mining results. Finally, the performance of i2MapReduce is analyzed by comparing the resultant graphs.
7

Zhang, Nan. "Privacy-preserving data mining." [College Station, Tex. : Texas A&M University, 2006. http://hdl.handle.net/1969.1/ETD-TAMU-1080.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Hulten, Geoffrey. "Mining massive data streams /." Thesis, Connect to this title online; UW restricted, 2005. http://hdl.handle.net/1773/6937.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Büchel, Nina. "Faktorenvorselektion im Data Mining /." Berlin : Logos, 2009. http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&doc_number=019006997&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Shao, Junming. "Synchronization Inspired Data Mining." Diss., lmu, 2011. http://nbn-resolving.de/urn:nbn:de:bvb:19-137356.

Full text
APA, Harvard, Vancouver, ISO, and other styles
11

Wang, Xiaohong. "Data mining with bilattices." Thesis, National Library of Canada = Bibliothèque nationale du Canada, 2001. http://www.collectionscanada.ca/obj/s4/f2/dsk3/ftp04/MQ59344.pdf.

Full text
APA, Harvard, Vancouver, ISO, and other styles
12

Knobbe, Arno J. "Multi-relational data mining /." Amsterdam [u.a.] : IOS Press, 2007. http://www.loc.gov/catdir/toc/fy0709/2006931539.html.

Full text
APA, Harvard, Vancouver, ISO, and other styles
13

丁嘉慧 and Ka-wai Ting. "Time sequences: data mining." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2001. http://hub.hku.hk/bib/B31226760.

Full text
APA, Harvard, Vancouver, ISO, and other styles
14

Wan, Chang, and 萬暢. "Mining multi-faceted data." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2013. http://hdl.handle.net/10722/197527.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Multi-faceted data contains different types of objects and relationships between them. With rapid growth of web-based services, multi-faceted data are increasing (e.g. Flickr, Yago, IMDB), which offers us richer information to infer users’ preferences and provide them better services. In this study, we look at two types of multi-faceted data: social tagging system and heterogeneous information network and how to improve service such as resources retrieving and classification on them. In social tagging systems, resources such as images and videos are annotated with descriptive words called tags. It has been shown that tag-based resource searching and retrieval is much more effective than content-based retrieval. With the advances in mobile technology, many resources are also geo-tagged with location information. We observe that a traditional tag (word) can carry different semantics at different locations. We study how location information can be used to help distinguish the different semantics of a resource’s tags and thus to improve retrieval accuracy. Given a search query, we propose a location-partitioning method that partitions all locations into regions such that the user query carries distinguishing semantics in each region. Based on the identified regions, we utilize location information in estimating the ranking scores of resources for the given query. These ranking scores are learned using the Bayesian Personalized Ranking (BPR) framework. Two algorithms, namely, LTD and LPITF, which apply Tucker Decomposition and Pairwise Interaction Tensor Factorization, respectively for modeling the ranking score tensor are proposed. Through experiments on real datasets, we show that LTD and LPITF outperform other tag-based resource retrieval methods. A heterogeneous information network (HIN) is used to model objects of different types and their relationships. Meta-paths are sequences of object types. They are used to represent complex relationships between objects beyond what links in a homogeneous network capture. We study the problem of classifying objects in an HIN. We propose class-level meta-paths and study how they can be used to (1) build more accurate classifiers and (2) improve active learning in identifying objects for which training labels should be obtained. We show that class-level meta-paths and object classification exhibit interesting synergy. Our experimental results show that the use of class-level meta-paths results in very effective active learning and good classification performance in HINs.
published_or_final_version
Computer Science
Master
Master of Philosophy
15

García-Osorio, César. "Data mining and visualization." Thesis, University of Exeter, 2005. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.414266.

Full text
APA, Harvard, Vancouver, ISO, and other styles
16

Wang, Grant J. (Grant Jenhorn) 1979. "Algorithms for data mining." Thesis, Massachusetts Institute of Technology, 2006. http://hdl.handle.net/1721.1/38315.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2006.
Includes bibliographical references (p. 81-89).
Data of massive size are now available in a wide variety of fields and come with great promise. In theory, these massive data sets allow data mining and exploration on a scale previously unimaginable. However, in practice, it can be difficult to apply classic data mining techniques to such massive data sets due to their sheer size. In this thesis, we study three algorithmic problems in data mining with consideration to the analysis of massive data sets. Our work is both theoretical and experimental - we design algorithms and prove guarantees for their performance and also give experimental results on real data sets. The three problems we study are: 1) finding a matrix of low rank that approximates a given matrix, 2) clustering high-dimensional points into subsets whose points lie in the same subspace, and 3) clustering objects by pairwise similarities/distances.
by Grant J. Wang.
Ph.D.
17

Anwar, Muhammad Naveed. "Data mining of audiology." Thesis, University of Sunderland, 2012. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.573120.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
This thesis describes the data mining of a large set of patient records from the hearing aid clinic at James Cook University Hospital in Middlesbrough, UK. As typical of medical data in general, these audiology records are heterogeneous, containing the following three different types of data: Audiograms (graphs of hearing ability at different frequencies) Structured tabular data (such as gender, date of birth and diagnosis) Unstructured text (specific observations made about each patient in a free- text or comment field) This audiology data set is unique, as it contains records of patients prescribed with both ITE and BTE hearing aids. ITE hearing aids are not generally available on the British National Health Service in England, as they are more expensive than BTE hearing aids. However, both types of aids are prescribed at James Cook University Hospital in Middlesbrough, UK, which is also an important feature of this data. There are two research questions for this research: Which factors influence the choice of ITE (in the ear) as opposed to BTE (behind the ear) hearing aids? For patients diagnosed with tinnitus (ringing in the ear), which factors influence the decision whether to fit a tinnitus masker (a gentle sound source, worn like a hearing aid, designed to drown out tinnitus)? A number of data mining techniques, such as clustering of audiograms, association analysis of variables (such as, age, gender, diagnosis, masker, mould and free text keywords) using contingency tables and principal component analysis on audiograms were used to find candidate variables to be combined into a decision support system (OSS) where unseen patient records are presented to the system, and the relative likelihood that a patient should be fitted with an ITE as opposed to a BTE aid or a tinnitus with masker as opposed to tinnitus not with masker is returned. The DSS was created using the techniques of logistic regression, Nalve Bayesian analysis and Bayesian network, and these systems were tested using 5 fold cross validations to see which of the techniques produced the better results. The advantage of these techniques for the combination of evidence is that it is easy to see which variables contributed to the final d~~Jpion. The constructed models and the data behind them were validated by"presenting them to the Principal audiologist, Dr. Robertshaw at James Cook University Hospital in Middlesbrough for comments and suggestions for improvements. The techniques developed in this thesis for the construction of prediction models were also used successfully on a different audiology data set from Malaysia. These decisions are typically made by audiology technicians working in the out- patient clinics, on the basis of audiogram results and in consultation with the patients. In many cases, the choice is clear cut, but at other times the technicians might benefit from a second opinion given by an automatic system with an explanation of how that second opinion was arrived at.
18

Santos, José Carlos Almeida. "Mining protein structure data." Master's thesis, FCT - UNL, 2006. http://hdl.handle.net/10362/1130.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
The principal topic of this work is the application of data mining techniques, in particular of machine learning, to the discovery of knowledge in a protein database. In the first chapter a general background is presented. Namely, in section 1.1 we overview the methodology of a Data Mining project and its main algorithms. In section 1.2 an introduction to the proteins and its supporting file formats is outlined. This chapter is concluded with section 1.3 which defines that main problem we pretend to address with this work: determine if an amino acid is exposed or buried in a protein, in a discrete way (i.e.: not continuous), for five exposition levels: 2%, 10%, 20%, 25% and 30%. In the second chapter, following closely the CRISP-DM methodology, whole the process of construction the database that supported this work is presented. Namely, it is described the process of loading data from the Protein Data Bank, DSSP and SCOP. Then an initial data exploration is performed and a simple prediction model (baseline) of the relative solvent accessibility of an amino acid is introduced. It is also introduced the Data Mining Table Creator, a program developed to produce the data mining tables required for this problem. In the third chapter the results obtained are analyzed with statistical significance tests. Initially the several used classifiers (Neural Networks, C5.0, CART and Chaid) are compared and it is concluded that C5.0 is the most suitable for the problem at stake. It is also compared the influence of parameters like the amino acid information level, the amino acid window size and the SCOP class type in the accuracy of the predictive models. The fourth chapter starts with a brief revision of the literature about amino acid relative solvent accessibility. Then, we overview the main results achieved and finally discuss about possible future work. The fifth and last chapter consists of appendices. Appendix A has the schema of the database that supported this thesis. Appendix B has a set of tables with additional information. Appendix C describes the software provided in the DVD accompanying this thesis that allows the reconstruction of the present work.
19

Garda-Osorio, Cesar. "Data mining and visualisation." Thesis, University of the West of Scotland, 2005. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.742763.

Full text
APA, Harvard, Vancouver, ISO, and other styles
20

Rawles, Simon Alan. "Object-oriented data mining." Thesis, University of Bristol, 2007. http://hdl.handle.net/1983/c13bda2c-75c9-4bfa-b86b-04ac06ba0278.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Attempts to overcome limitations in the attribute-value representation for machine learning has led to much interest in learning from structured data, concentrated in the research areas of inductive logic programming (ILP) and multi-relational data mining (MDRM). The expressivenessa nd encapsulationo f the object-oriented data model has led to its widespread adoption in software and database design. The considerable congruence between this model and individual-centred models in inductive logic programming presents new opportunities for mining object data specific to its domain. This thesis investigates the use of object-orientation in knowledge representation for multi-relational data mining. We propose a language for expressing object model metaknowledge and use it to extend the reasoning mechanisms of an object-oriented logic. A refinement operator is then defined and used for feature search in a object-oriented propositionalisation-based ILP classifier. An algorithm is proposed for reducing the large number of redundant features typical in propositionalisation. A data mining system based on the refinement operator is implemented and demonstrated on a real-world computational linguistics task and compared with a conventional ILP system. Keywords: Object orientation; data mining; inductive logic programming; propositionalisation; refinement operators; feature reduction
21

Mao, Shihong. "Comparative Microarray Data Mining." Wright State University / OhioLINK, 2007. http://rave.ohiolink.edu/etdc/view?acc_num=wright1198695415.

Full text
APA, Harvard, Vancouver, ISO, and other styles
22

Novák, Petr. "Data mining časových řad." Master's thesis, Vysoká škola ekonomická v Praze, 2009. http://www.nusl.cz/ntk/nusl-72068.

Full text
APA, Harvard, Vancouver, ISO, and other styles
23

Blunt, Gordon. "Mining credit card data." Thesis, n.p, 2002. http://ethos.bl.uk/.

Full text
APA, Harvard, Vancouver, ISO, and other styles
24

Niggemann, Oliver. "Visual data mining of graph based data." [S.l. : s.n.], 2001. http://deposit.ddb.de/cgi-bin/dokserv?idn=962400505.

Full text
APA, Harvard, Vancouver, ISO, and other styles
25

Li, Liangchun. "Web-based data visualization for data mining." Thesis, National Library of Canada = Bibliothèque nationale du Canada, 1998. http://www.collectionscanada.ca/obj/s4/f2/dsk2/ftp03/MQ35845.pdf.

Full text
APA, Harvard, Vancouver, ISO, and other styles
26

Al-Hashemi, Idrees Yousef. "Applying data mining techniques over big data." Thesis, Boston University, 2013. https://hdl.handle.net/2144/21119.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Thesis (M.S.C.S.) PLEASE NOTE: Boston University Libraries did not receive an Authorization To Manage form for this thesis or dissertation. It is therefore not openly accessible, though it may be available by request. If you are the author or principal advisor of this work and would like to request open access for it, please contact us at open-help@bu.edu. Thank you.
The rapid development of information technology in recent decades means that data appear in a wide variety of formats — sensor data, tweets, photographs, raw data, and unstructured data. Statistics show that there were 800,000 Petabytes stored in the world in 2000. Today’s internet has about 0.1 Zettabytes of data (ZB is about 1021 bytes), and this number will reach 35 ZB by 2020. With such an overwhelming flood of information, present data management systems are not able to scale to this huge amount of raw, unstructured data—in today’s parlance, Big Data. In the present study, we show the basic concepts and design of Big Data tools, algorithms, and techniques. We compare the classical data mining algorithms to the Big Data algorithms by using Hadoop/MapReduce as a core implementation of Big Data for scalable algorithms. We implemented the K-means algorithm and A-priori algorithm with Hadoop/MapReduce on a 5 nodes Hadoop cluster. We explore NoSQL databases for semi-structured, massively large-scaling of data by using MongoDB as an example. Finally, we show the performance between HDFS (Hadoop Distributed File System) and MongoDB data storage for these two algorithms.
27

Zhou, Wubai. "Data Mining Techniques to Understand Textual Data." FIU Digital Commons, 2017. https://digitalcommons.fiu.edu/etd/3493.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
More than ever, information delivery online and storage heavily rely on text. Billions of texts are produced every day in the form of documents, news, logs, search queries, ad keywords, tags, tweets, messenger conversations, social network posts, etc. Text understanding is a fundamental and essential task involving broad research topics, and contributes to many applications in the areas text summarization, search engine, recommendation systems, online advertising, conversational bot and so on. However, understanding text for computers is never a trivial task, especially for noisy and ambiguous text such as logs, search queries. This dissertation mainly focuses on textual understanding tasks derived from the two domains, i.e., disaster management and IT service management that mainly utilizing textual data as an information carrier. Improving situation awareness in disaster management and alleviating human efforts involved in IT service management dictates more intelligent and efficient solutions to understand the textual data acting as the main information carrier in the two domains. From the perspective of data mining, four directions are identified: (1) Intelligently generate a storyline summarizing the evolution of a hurricane from relevant online corpus; (2) Automatically recommending resolutions according to the textual symptom description in a ticket; (3) Gradually adapting the resolution recommendation system for time correlated features derived from text; (4) Efficiently learning distributed representation for short and lousy ticket symptom descriptions and resolutions. Provided with different types of textual data, data mining techniques proposed in those four research directions successfully address our tasks to understand and extract valuable knowledge from those textual data. My dissertation will address the research topics outlined above. Concretely, I will focus on designing and developing data mining methodologies to better understand textual information, including (1) a storyline generation method for efficient summarization of natural hurricanes based on crawled online corpus; (2) a recommendation framework for automated ticket resolution in IT service management; (3) an adaptive recommendation system on time-varying temporal correlated features derived from text; (4) a deep neural ranking model not only successfully recommending resolutions but also efficiently outputting distributed representation for ticket descriptions and resolutions.
28

KAVOOSIFAR, MOHAMMAD REZA. "Data Mining and Indexing Big Multimedia Data." Doctoral thesis, Politecnico di Torino, 2019. http://hdl.handle.net/11583/2742526.

Full text
APA, Harvard, Vancouver, ISO, and other styles
29

Adderly, Darryl M. "Data mining meets e-commerce using data mining to improve customer relationship management /." [Gainesville, Fla.]: University of Florida, 2002. http://purl.fcla.edu/fcla/etd/UFE0000500.

Full text
APA, Harvard, Vancouver, ISO, and other styles
30

Vithal, Kadam Omkar. "Novel applications of Association Rule Mining- Data Stream Mining." AUT University, 2009. http://hdl.handle.net/10292/826.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
From the advent of association rule mining, it has become one of the most researched areas of data exploration schemes. In recent years, implementing association rule mining methods in extracting rules from a continuous flow of voluminous data, known as Data Stream has generated immense interest due to its emerging applications such as network-traffic analysis, sensor-network data analysis. For such typical kinds of application domains, the facility to process such enormous amount of stream data in a single pass is critical.
31

Patel, Akash. "Data Mining of Process Data in Multivariable Systems." Thesis, KTH, Skolan för elektro- och systemteknik (EES), 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-201087.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Performing system identification experiments in order to model control plantsin industry processes can be costly and time consuming. Therefore, with increasinglymore computational power available and abundant access to loggedhistorical data from plants, data mining algorithms have become more appealing.This thesis focuses on evaluating a data mining algorithm for multivariate processwhere the mined data can potentially be used for system identification.The first part of the thesis explores the effect many of the necessary user chosenparameters have on the algorithmic performance. In order to do this, a GUIdesigned with assisting in parameter selection is developed. The second partof the thesis evaluates the proposed algorithm’s performance by modelling asimulated process based on intervals found by the algorithm.The results show that the algorithm is particularly sensitive to the choice ofcut-off frequencies in the bandpass filter, threshold of the reciprocal conditionnumber and the Laguerre filter order. It is also shown that with the GUI itis possible to select parameters such that the algorithm performs satisfactoryand mines data relevant for system identification. Finally, the results show thatit’s possible to use the mined data in order to model a simulated process usingsystem identification techniques with good accuracy.
Modellering av reglersystem i industriprocesser med hjälp av system identifieringsexperiment, kan vara både kostsammt och tidskrävande. Ökad tillgångtill stora volymer av historisk lagrad data och processorkraft har därmed väcktstort intresse för data mining algoritmer.Denna avhandling fokuserar på utvärderingen av en data minig algoritm för mulitvariablaprocesser där de utvunna data segmenten can potenitellt användasför system identifiering. Första delen av avhandlingen utforskar vilken effektalgoritmens många parametrar har på dess prestanda. För att förenkla valenav parametrarna, utveklades ett användargränsnitt. Den andra delen av avhandlingenutvärderar algoritmens prestanda genom att modellera en simuleradprocess som är baserad på de utvunna data segment.Resultaten visar att algoritmen är särskilt känslig mot valen av brytfrekvensernai bandpassfiltret, tröskel värdet för det reciproka konditions talet och ordernpå Laguerre filtret. Dessutom visar resultaten att det är, genom det utveckladeanvändargränssnittet, möjligt att välja parameter värden som ger godtyckligautvunna data segment. Slutgiltigen kan det konstateras att man kan medhög nogrannhet modellera en simulerad process med hjälp av de utvunna datasegmenten från algoritmen.
32

Cordeiro, Robson Leonardo Ferreira. "Data mining in large sets of complex data." Universidade de São Paulo, 2011. http://www.teses.usp.br/teses/disponiveis/55/55134/tde-22112011-083653/.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Due to the increasing amount and complexity of the data stored in the enterprises\' databases, the task of knowledge discovery is nowadays vital to support strategic decisions. However, the mining techniques used in the process usually have high computational costs that come from the need to explore several alternative solutions, in different combinations, to obtain the desired knowledge. The most common mining tasks include data classification, labeling and clustering, outlier detection and missing data prediction. Traditionally, the data are represented by numerical or categorical attributes in a table that describes one element in each tuple. Although the same tasks applied to traditional data are also necessary for more complex data, such as images, graphs, audio and long texts, the complexity and the computational costs associated to handling large amounts of these complex data increase considerably, making most of the existing techniques impractical. Therefore, especial data mining techniques for this kind of data need to be developed. This Ph.D. work focuses on the development of new data mining techniques for large sets of complex data, especially for the task of clustering, tightly associated to other data mining tasks that are performed together. Specifically, this Doctoral dissertation presents three novel, fast and scalable data mining algorithms well-suited to analyze large sets of complex data: the method Halite for correlation clustering; the method BoW for clustering Terabyte-scale datasets; and the method QMAS for labeling and summarization. Our algorithms were evaluated on real, very large datasets with up to billions of complex elements, and they always presented highly accurate results, being at least one order of magnitude faster than the fastest related works in almost all cases. The real data used come from the following applications: automatic breast cancer diagnosis, satellite imagery analysis, and graph mining on a large web graph crawled by Yahoo! and also on the graph with all users and their connections from the Twitter social network. Such results indicate that our algorithms allow the development of real time applications that, potentially, could not be developed without this Ph.D. work, like a software to aid on the fly the diagnosis process in a worldwide Healthcare Information System, or a system to look for deforestation within the Amazon Rainforest in real time
O crescimento em quantidade e complexidade dos dados armazenados nas organizações torna a extração de conhecimento utilizando técnicas de mineração uma tarefa ao mesmo tempo fundamental para aproveitar bem esses dados na tomada de decisões estratégicas e de alto custo computacional. O custo vem da necessidade de se explorar uma grande quantidade de casos de estudo, em diferentes combinações, para se obter o conhecimento desejado. Tradicionalmente, os dados a explorar são representados como atributos numéricos ou categóricos em uma tabela, que descreve em cada tupla um caso de teste do conjunto sob análise. Embora as mesmas tarefas desenvolvidas para dados tradicionais sejam também necessárias para dados mais complexos, como imagens, grafos, áudio e textos longos, a complexidade das análises e o custo computacional envolvidos aumentam significativamente, inviabilizando a maioria das técnicas de análise atuais quando aplicadas a grandes quantidades desses dados complexos. Assim, técnicas de mineração especiais devem ser desenvolvidas. Este Trabalho de Doutorado visa a criação de novas técnicas de mineração para grandes bases de dados complexos. Especificamente, foram desenvolvidas duas novas técnicas de agrupamento e uma nova técnica de rotulação e sumarização que são rápidas, escaláveis e bem adequadas à análise de grandes bases de dados complexos. As técnicas propostas foram avaliadas para a análise de bases de dados reais, em escala de Terabytes de dados, contendo até bilhões de objetos complexos, e elas sempre apresentaram resultados de alta qualidade, sendo em quase todos os casos pelo menos uma ordem de magnitude mais rápidas do que os trabalhos relacionados mais eficientes. Os dados reais utilizados vêm das seguintes aplicações: diagnóstico automático de câncer de mama, análise de imagens de satélites, e mineração de grafos aplicada a um grande grafo da web coletado pelo Yahoo! e também a um grafo com todos os usuários da rede social Twitter e suas conexões. Tais resultados indicam que nossos algoritmos permitem a criação de aplicações em tempo real que, potencialmente, não poderiam ser desenvolvidas sem a existência deste Trabalho de Doutorado, como por exemplo, um sistema em escala global para o auxílio ao diagnóstico médico em tempo real, ou um sistema para a busca por áreas de desmatamento na Floresta Amazônica em tempo real
33

XIAO, XIN. "Data Mining Techniques for Complex User-Generated Data." Doctoral thesis, Politecnico di Torino, 2016. http://hdl.handle.net/11583/2644046.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Nowadays, the amount of collected information is continuously growing in a variety of different domains. Data mining techniques are powerful instruments to effectively analyze these large data collections and extract hidden and useful knowledge. Vast amount of User-Generated Data (UGD) is being created every day, such as user behavior, user-generated content, user exploitation of available services and user mobility in different domains. Some common critical issues arise for the UGD analysis process such as the large dataset cardinality and dimensionality, the variable data distribution and inherent sparseness, and the heterogeneous data to model the different facets of the targeted domain. Consequently, the extraction of useful knowledge from such data collections is a challenging task, and proper data mining solutions should be devised for the problem under analysis. In this thesis work, we focus on the design and development of innovative solutions to support data mining activities over User-Generated Data characterised by different critical issues, via the integration of different data mining techniques in a unified frame- work. Real datasets coming from three example domains characterized by the above critical issues are considered as reference cases, i.e., health care, social network, and ur- ban environment domains. Experimental results show the effectiveness of the proposed approaches to discover useful knowledge from different domains.
34

Tong, Suk-man Ivy. "Techniques in data stream mining." Click to view the E-thesis via HKUTO, 2005. http://sunzi.lib.hku.hk/hkuto/record/B34737376.

Full text
APA, Harvard, Vancouver, ISO, and other styles
35

Borgelt, Christian. "Data mining with graphical models." [S.l. : s.n.], 2000. http://deposit.ddb.de/cgi-bin/dokserv?idn=962912107.

Full text
APA, Harvard, Vancouver, ISO, and other styles
36

Weber, Irene. "Suchraumbeschränkung für relationales Data Mining." [S.l. : s.n.], 2004. http://www.bsz-bw.de/cgi-bin/xvms.cgi?SWB11380447.

Full text
APA, Harvard, Vancouver, ISO, and other styles
37

Maden, Engin. "Data Mining On Architecture Simulation." Master's thesis, METU, 2010. http://etd.lib.metu.edu.tr/upload/2/12611635/index.pdf.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Data mining is the process of extracting patterns from huge data. One of the branches in data mining is mining sequence data and here the data can be viewed as a sequence of events and each event has an associated time of occurrence. Sequence data is modelled using episodes and events are included in episodes. The aim of this thesis work is analysing architecture simulation output data by applying episode mining techniques, showing the previously known relationships between the events in architecture and providing an environment to predict the performance of a program in an architecture before executing the codes. One of the most important points here is the application area of episode mining techniques. Architecture simulation data is a new domain to apply these techniques and by using the results of these techniques making predictions about the performance of programs in an architecture before execution can be considered as a new approach. For this purpose, by implementing three episode mining techniques which are WINEPI approach, non-overlapping occurrence based approach and MINEPI approach a data mining tool has been developed. This tool has three main components. These are data pre-processor, episode miner and output analyser.
38

Drwal, Maciej. "Data mining in distributedcomputer systems." Thesis, Blekinge Tekniska Högskola, Sektionen för datavetenskap och kommunikation, 2009. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-5709.

Full text
APA, Harvard, Vancouver, ISO, and other styles
39

Thun, Julia, and Rebin Kadouri. "Automating debugging through data mining." Thesis, KTH, Data- och elektroteknik, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-203244.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Contemporary technological systems generate massive quantities of log messages. These messages can be stored, searched and visualized efficiently using log management and analysis tools. The analysis of log messages offer insights into system behavior such as performance, server status and execution faults in web applications. iStone AB wants to explore the possibility to automate their debugging process. Since iStone does most parts of their debugging manually, it takes time to find errors within the system. The aim was therefore to find different solutions to reduce the time it takes to debug. An analysis of log messages within access – and console logs were made, so that the most appropriate data mining techniques for iStone’s system would be chosen. Data mining algorithms and log management and analysis tools were compared. The result of the comparisons showed that the ELK Stack as well as a mixture between Eclat and a hybrid algorithm (Eclat and Apriori) were the most appropriate choices. To demonstrate their feasibility, the ELK Stack and Eclat were implemented. The produced results show that data mining and the use of a platform for log analysis can facilitate and reduce the time it takes to debug.
Dagens system genererar stora mängder av loggmeddelanden. Dessa meddelanden kan effektivt lagras, sökas och visualiseras genom att använda sig av logghanteringsverktyg. Analys av loggmeddelanden ger insikt i systemets beteende såsom prestanda, serverstatus och exekveringsfel som kan uppkomma i webbapplikationer. iStone AB vill undersöka möjligheten att automatisera felsökning. Eftersom iStone till mestadels utför deras felsökning manuellt så tar det tid att hitta fel inom systemet. Syftet var att därför att finna olika lösningar som reducerar tiden det tar att felsöka. En analys av loggmeddelanden inom access – och konsolloggar utfördes för att välja de mest lämpade data mining tekniker för iStone’s system. Data mining algoritmer och logghanteringsverktyg jämfördes. Resultatet av jämförelserna visade att ELK Stacken samt en blandning av Eclat och en hybrid algoritm (Eclat och Apriori) var de lämpligaste valen. För att visa att så är fallet så implementerades ELK Stacken och Eclat. De framställda resultaten visar att data mining och användning av en plattform för logganalys kan underlätta och minska den tid det tar för att felsöka.
40

Rahman, Sardar Muhammad Monzurur, and mrahman99@yahoo com. "Data Mining Using Neural Networks." RMIT University. Electrical & Computer Engineering, 2006. http://adt.lib.rmit.edu.au/adt/public/adt-VIT20080813.094814.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Data mining is about the search for relationships and global patterns in large databases that are increasing in size. Data mining is beneficial for anyone who has a huge amount of data, for example, customer and business data, transaction, marketing, financial, manufacturing and web data etc. The results of data mining are also referred to as knowledge in the form of rules, regularities and constraints. Rule mining is one of the popular data mining methods since rules provide concise statements of potentially important information that is easily understood by end users and also actionable patterns. At present rule mining has received a good deal of attention and enthusiasm from data mining researchers since rule mining is capable of solving many data mining problems such as classification, association, customer profiling, summarization, segmentation and many others. This thesis makes several contributions by proposing rule mining methods using genetic algorithms and neural networks. The thesis first proposes rule mining methods using a genetic algorithm. These methods are based on an integrated framework but capable of mining three major classes of rules. Moreover, the rule mining processes in these methods are controlled by tuning of two data mining measures such as support and confidence. The thesis shows how to build data mining predictive models using the resultant rules of the proposed methods. Another key contribution of the thesis is the proposal of rule mining methods using supervised neural networks. The thesis mathematically analyses the Widrow-Hoff learning algorithm of a single-layered neural network, which results in a foundation for rule mining algorithms using single-layered neural networks. Three rule mining algorithms using single-layered neural networks are proposed for the three major classes of rules on the basis of the proposed theorems. The thesis also looks at the problem of rule mining where user guidance is absent. The thesis proposes a guided rule mining system to overcome this problem. The thesis extends this work further by comparing the performance of the algorithm used in the proposed guided rule mining system with Apriori data mining algorithm. Finally, the thesis studies the Kohonen self-organization map as an unsupervised neural network for rule mining algorithms. Two approaches are adopted based on the way of self-organization maps applied in rule mining models. In the first approach, self-organization map is used for clustering, which provides class information to the rule mining process. In the second approach, automated rule mining takes the place of trained neurons as it grows in a hierarchical structure.
41

Guo, Shishan. "Data mining in crystallographic databases." Thesis, National Library of Canada = Bibliothèque nationale du Canada, 2000. http://www.collectionscanada.ca/obj/s4/f2/dsk1/tape3/PQDD_0012/NQ52854.pdf.

Full text
APA, Harvard, Vancouver, ISO, and other styles
42

Sun, Wenyi. "Data mining extension for economics." Diss., Columbia, Mo. : University of Missouri-Columbia, 2006. http://hdl.handle.net/10355/5869.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Thesis (M.S.)--University of Missouri-Columbia, 2006.
The entire dissertation/thesis text is included in the research.pdf file; the official abstract appears in the short.pdf file (which also appears in the research.pdf); a non-technical general description, or public abstract, appears in the public.pdf file. Title from title screen of research.pdf file (viewed on September ) Vita. Includes bibliographical references.
43

Papadatos, George. "Data mining for lead optimisation." Thesis, University of Sheffield, 2011. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.556989.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
The recurring theme of this thesis is the application of diverse data mining and chemoinformatics techniques to structural and experimental property data, and particularly to data produced during the stage of drug discovery called lead optimisation. The work reported here seeks to provide more than one rational answer to the real-life issues routinely facing medicinal chemists. The thesis is divided into three parts: In the first part, several methodologies are described which facilitate the automatic mining of temporal, hierarchical lead optimisation data from the archives. Then, these data are appropriately used to provide informative visualisations, with regard to the exploration of chemical space, both locally (i.e. on a chemical array level) and globally (i.e. in the whole project). Finally, several ways of assessing the progress of a particular lead optimisation project are investigated. The second part of the thesis compares and assesses the relative merits of two computational methods that quantify the neighbourhood behaviour of a descriptor. The main conclusions of this part are two-fold: firstly, the optimality criterion method is demonstrated to be a suitable way to select descriptors for the systematic exploration of chemical space during array-based lead optimisation; secondly, regarding the actual neighbourhood behaviour performance exhibited by twelve types of fingerprints, it is shown that circular-based ones perform consistently better than the others and, notably, at a much lower similarity threshold. The third part focuses on explicit structural transformations between molecular pairs and their impact on properties such as hERG channel blocking, solubility and lipophilicity. More importantly, the study investigates the context of a transformation and its role on the impact of a particular modification. Using substructural descriptors to represent the context of a transformation, and considering both the local and the global environment, several contextsensitive cases are identified and rationalised. Overall, it is demonstrated that the inclusion of contextual information can enhance the predictive power of matched molecular pair analysis. Several context-sensitive examples are also identified in publicly available data.
44

Rice, Simon B. "Text data mining in bioinformatics." Thesis, University of Manchester, 2005. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.488351.

Full text
APA, Harvard, Vancouver, ISO, and other styles
45

Lin, Zhenmin. "Privacy Preserving Distributed Data Mining." UKnowledge, 2012. http://uknowledge.uky.edu/cs_etds/9.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Privacy preserving distributed data mining aims to design secure protocols which allow multiple parties to conduct collaborative data mining while protecting the data privacy. My research focuses on the design and implementation of privacy preserving two-party protocols based on homomorphic encryption. I present new results in this area, including new secure protocols for basic operations and two fundamental privacy preserving data mining protocols. I propose a number of secure protocols for basic operations in the additive secret-sharing scheme based on homomorphic encryption. I derive a basic relationship between a secret number and its shares, with which we develop efficient secure comparison and secure division with public divisor protocols. I also design a secure inverse square root protocol based on Newton's iterative method and hence propose a solution for the secure square root problem. In addition, we propose a secure exponential protocol based on Taylor series expansions. All these protocols are implemented using secure multiplication and can be used to develop privacy preserving distributed data mining protocols. In particular, I develop efficient privacy preserving protocols for two fundamental data mining tasks: multiple linear regression and EM clustering. Both protocols work for arbitrarily partitioned datasets. The two-party privacy preserving linear regression protocol is provably secure in the semi-honest model, and the EM clustering protocol discloses only the number of iterations. I provide a proof-of-concept implementation of these protocols in C++, based on the Paillier cryptosystem.
46

Tong, Suk-man Ivy, and 湯淑敏. "Techniques in data stream mining." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2005. http://hub.hku.hk/bib/B34737376.

Full text
APA, Harvard, Vancouver, ISO, and other styles
47

Luo, Man. "Data mining and classical statistics." Virtual Press, 2004. http://liblink.bsu.edu/uhtbin/catkey/1304657.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
This study introduces an overview of data mining. It suggests that methods derived from classical statistics are an integrated part of data mining. However, there are substantial differences between these two areas. Classical statistical models and non-statistical models used in data mining, such as regression trees and artificial neural networks, are presented to emphasize their unique approaches to extract information from data. In summation, this research provides some background to data mining and the role of classical statistics played in it.
Department of Mathematical Sciences
48

Cai, Zhongming. "Technical aspects of data mining." Thesis, Cardiff University, 2001. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.395784.

Full text
APA, Harvard, Vancouver, ISO, and other styles
49

Shioda, Romy 1977. "Integer optimization in data mining." Thesis, Massachusetts Institute of Technology, 2003. http://hdl.handle.net/1721.1/17579.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Thesis (Ph. D.)--Massachusetts Institute of Technology, Sloan School of Management, Operations Research Center, 2003.
Includes bibliographical references (p. 103-107).
While continuous optimization methods have been widely used in statistics and data mining over the last thirty years, integer optimization has had very limited impact in statistical computation. Thus, our objective is to develop a methodology utilizing state of the art integer optimization methods to exploit the discrete character of data mining problems. The thesis consists of two parts: The first part illustrates a mixed-integer optimization method for classification and regression that we call Classification and Regression via Integer Optimization (CRIO). CRIO separates data points in different polyhedral regions. In classification each region is assigned a class, while in regression each region has its own distinct regression coefficients. Computational experimentation with real data sets shows that CRIO is comparable to and often outperforms the current leading methods in classification and regression. The second part describes our cardinality-constrained quadratic mixed-integer optimization algorithm, used to solve subset selection in regression and portfolio selection in asset allocation. We take advantage of the special structures of these problems by implementing a combination of implicit branch-and-bound, Lemke's pivoting method, variable deletion and problem reformulation. Testing against popular heuristic methods and CPLEX 8.0's quadratic mixed-integer solver, we see that our tailored approach to these quadratic variable selection problems have significant advantages over simple heuristics and generalized solvers.
by Romy Shioda.
Ph.D.
50

Lo, Ya-Chin, and 羅雅琴. "Data mining in bioinformatics -- NCBI tools for data mining." Thesis, 2004. http://ndltd.ncl.edu.tw/handle/38227591029165701821.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
碩士
靜宜大學
資訊管理學系研究所
92
Bioinformatics represents a new, growing area of science that computational approaches to answer biological questions. With the explosion of sequence and structural information available to researchers, the field of bioinformatics is playing an increasingly large role in the study of fundamental biomedical problems. The functional view of bioinformatics is the representation, storage, and distribution of data. Data mining is used to refer the process of searching through a large volume of data, stored into a database, to discover interesting and useful information previously unknown. Bioinformatics provides opportunities for developing different data mining methods. Data mining will play an increasingly important role in the analysis and discovery of sequence, structure and functional patterns or models from large sequence databases. NCBI provides large-scale informatics systems that will support scientific inquiry well into the future. The mission of the NCBI is to develop new information technologies to aid in the understanding of fundamental molecular and genetic processes that control health and disease. In this thesis, we enumerate several kinds of data mining tools often used inside NCBI. We also introduce the characteristics of these tools and basic operation methods . So we can understand the data mining application and development of limitless latent energy appeared in bioinformatics field.

To the bibliography