Dissertations / Theses: 'Structured data'

1

Amornsinlaphachai, Pensri. "Updating semi-structured data." Thesis, Northumbria University, 2007. http://nrl.northumbria.ac.uk/3422/.

Full text

Abstract:

The Web has had a tremendous success with its support for the rapid and inexpensive exchange of information. A considerable body of data exchange is in the form of semi- structured data such as the eXtensible Markup Language (XML). XML, an effective standard to represent and exchange semi-structured data on the Web, is used ubiquitously in almost all areas of information technology. Most researchers in the XML area have concentrated on storing, querying and publishing XML while not many have paid attention to updating XML; thus the XML update area is not fully developed. We propose a solution for updating XML as a representation of semi-structured data. XML is updated through an object-relational database (ORDB) to exploit the maturity of the relational engine and the newer object features of the OR technology. The engine is used to enforce constraints during the updating of the XML whereas the object features are used to handle the XML hierarchical structure. Updating XML via ORDB makes it easier to join XML documents in an update and in turn joins of XML documents make it possible to keep non-redundant data in multiple XML documents. This thesis contributes a solution for the update of XML documents via an ORDB to advance our understanding of the XML update area. Rules for mapping XML structure and constraints to an ORDB schema are presented and a mechanism to handle XML cardinality constraint is provided. An XML update language, an extension to XQuery, has been designed and this language is translated into the standard SQL executed on an ORDB. To handle the recursive nature of XML, a recursive function updating XML data is translated into SQL commands equipped with a programming capability. A method is developed to reflect the changes from the ORDB to XML documents. A prototype of the solution has been implemented to help validate our approach. Experimental study to evaluate the performance of XML update processing based on the prototype has been conducted. The experimental results show that updating multiple XML documents storing non-redundant data yields a better performance than updating a single XML document storing redundant data; an ORDB can take advantage of this by caching data to a greater extent than a native XML database. The solution of updating XML documents via an ORDB can solve some problems in existing update methods as follows. Firstly, the preservation of XML constraints is handled by the ORDB engine. Secondly, non-redundant data is stored in linked XML documents; thus the problem of data inconsistency and low performance caused by data redundancy are solved. Thirdly, joins of XML documents are converted to joins of tables in SQL. Fourthly, fields or tables involved in regular path expressions can be tackled in a short time by using mapping data. Finally, a recursive function is translated into SQL commands equipped with a programming capability.

APA, Harvard, Vancouver, ISO, and other styles

2

Yang, Lei. "Querying Graph Structured Data." Case Western Reserve University School of Graduate Studies / OhioLINK, 2015. http://rave.ohiolink.edu/etdc/view?acc_num=case1410434109.

Full text

APA, Harvard, Vancouver, ISO, and other styles

3

Al-Wasil, Fahad M. "Querying distributed heterogeneous structured and semi-structured data sources." Thesis, Cardiff University, 2007. http://orca.cf.ac.uk/56144/.

Full text

Abstract:

The continuing growth and widespread popularity of the internet means that the collection of useful data available for public access is rapidly increasing both in number and size. These data are spread over distributed heterogeneous data sources like traditional databases or sources of various forms containing unstructured and semi-structured data. Obviously, the value of these data sources would in many cases be greatly enhanced if the data they contain could be combined and queried in a uniform manner. The research work reported in this dissertation is concerned with querying and integrating a multiplicity of distributed heterogeneous structured data residing in relational databases and semi-structured data held in well- formed XML documents produced by internet applications or human- coded. In particular, we have addressed the problems of: (1) specifying the mappings between a global schema and the local data sources' schemas, and resolving the heterogeneity which can occur between data models, schemas or schema concepts (2) processing queries that are expressed on a global schema into local queries. We have proposed an approach to combine and query the data sources through a mediation layer. Such a layer is intended to establish and evolve an XML Metadata Knowledge Base (XMKB) incrementally which assists the Query Processor in mediating between user queries posed over the global schema and the queries on the underlying distributed heterogeneous data sources. It translates such queries into sub-queries -called local queries- which are appropriate to each local data source. The XMKB is built in a bottom-up fashion by extracting and merging incrementally the metadata of the data sources. It holds the data source's information (names, types and locations), descriptions of the mappings between the global schema and the participating data source schemas, and function names for handling semantic and structural discrepancies between the representations. To demonstrate our research, we have designed and implemented a prototype system called SISSD (System to Integrate Structured and Semi- structured Databases). The system automatically creates a GUI tool for meta-users (who do the metadata integration) which they use to describe mappings between the global schema and local data source schemas. These mappings are used to produce the XMKB. The SISSD allows the translation of user queries into sub-queries fitting each participating data source, by exploiting the mapping information stored in the XMKB. The major results of the thesis are: (1) an approach that facilitates building structured and semi-structured data integration systems (2) a method for generating mappings between a global and local schemas' paths, and resolving the conflicts caused by the heterogeneity of the data sources such as naming, structural, and semantic conflicts which, may occur between the schemas (3) a method for translating queries in terms of a global schema into sub-queries in terms of local schemas. Hence, the presented approach shows that: (a) mapping of the schemas' paths can only be partially automated, since the logical heterogeneity problems need to be resolved by human judgment based on the application requirements (b) querying distributed heterogeneous structured and semi-structured data sources is possible.

APA, Harvard, Vancouver, ISO, and other styles

4

Su, Wei. "Motif Mining On Structured And Semi-structured Biological Data." Case Western Reserve University School of Graduate Studies / OhioLINK, 2013. http://rave.ohiolink.edu/etdc/view?acc_num=case1365089538.

Full text

APA, Harvard, Vancouver, ISO, and other styles

5

Tripney, Brian Grieve. "Data value storage for compressed semi-structured data." Thesis, University of Strathclyde, 2012. http://oleg.lib.strath.ac.uk:80/R/?func=dbin-jump-full&object_id=18962.

Full text

Abstract:

Growing user expectations of anywhere, anytime access to information require new types of data transfer to be considered. While semi-structured data is a common data exchange format, its verbose nature makes les of this type too large to be transferred quickly, especially where only a small part of that data is required by the user. There is consequently a need to develop new models of data storage to support the sharing of small segments of semi-structured data as existing XML compressors require the transfer of the entire compressed structure as a whole. This thesis examines the potential for bisimilarity-based partitioning (i.e. the grouping of items with similar structural patterns) to be combined with dictionary compression methods to produce a data storage model that remains directly accessible for query processing whilst facilitating the sharing of individual data segments. The use of dictionary compression is shown to compare favourably against Hu mantype compression, especially with regard to real world data sets, while a study of the e ects of di ering types of bisimilarity upon the storage of data values identi ed the use of both forwards and backwards bisimilarity as the most promising basis for a dictionary-compressed structure. Having employed the above in a combined storage model, a query strategy is detailed which takes advantage of the compressed structure to reduce the number of data segments that must be accessed (and therefore transferred) to answer a query. A method to remove redundancy within the data dictionaries is also described and shown to have a positive e ect in terms of disk space usage.

APA, Harvard, Vancouver, ISO, and other styles

6

Mintram, Robert C. "Vector representations of structured data." Thesis, Southampton Solent University, 2002. http://ssudl.solent.ac.uk/624/.

Full text

Abstract:

The connectionist approach to creating vector representations (VREPs) of structured data is usually implemented by artificial neural network (ANN) architectures. ANNs are trained on a representative corpus and can then demonstrate some degree of generalization to novel data. In this context, structured data are typically trees, the leaf nodes of which are assigned some n-element (often binary) vector representation. The strategy used to encode the leaf data and the width of the consequent vectors can have an impact on the encoding performance of the ANN architecture. In this thesis the architecture of principle interest is called simplified recursive auto associative memory, (S)RAAM, which was devised to provide a theoretical model for abother architecture called recursive auto associative memory, RAAM. Research continues in RAAMs in terms of improving their learning ability, understanding the features that are encoded and improving generalization. (S)RAAM is a mathematical model that lends itself more readily to addressing these issues. Usually ANNs designed to encode structured data will, as a result of training, simultaneously create an encoder function to transform the data into vectors and a decoder function to perform the reverse transformation. (S)RAAM is a mathematical model that lends itself more readily to addressing these issues. Usually ANNs designed to encode structured data will, as a result of training, simultaneously create an encoder function to transform the data into vectors and a decoder function to perform the reverse transformation. (S)RAAM as a model of this process was designed to follow this paradigm. It is shown that this is not strictly necessary and that encoder and decoder functions can be created at separate times, their connection being maintained by the data unpon which they operate. This leads to a new, more versatile model called, in this thesis, the General Encoder Decoder, GED. The GED, like (S)RAAM, is implemented as an algorithm rather than a neural network architecture. The thesis contends that the broad scope of the GED model makes it a versatile experimental vehicle supporting research into key properties of VREPs. In particular these properties include the strategy used to encode the leaf tokens within tree structures and the features of these structures that are preferentially encoded

APA, Harvard, Vancouver, ISO, and other styles

7

Zhang, Chiyuan Ph D. Massachusetts Institute of Technology. "Deep learning and structured data." Thesis, Massachusetts Institute of Technology, 2018. http://hdl.handle.net/1721.1/115643.

Full text

Abstract:

Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2018.
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Cataloged from student-submitted PDF version of thesis.
Includes bibliographical references (pages 135-150).
In the recent years deep learning has witnessed successful applications in many different domains such as visual object recognition, detection and segmentation, automatic speech recognition, natural language processing, and reinforcement learning. In this thesis, we will investigate deep learning from a spectrum of different perspectives. First of all, we will study the question of generalization, which is one of the most fundamental notion in machine learning theory. We will show how, in the regime of deep learning, the characterization of generalization becomes different from the conventional way, and propose alternative ways to approach it. Moving from theory to more practical perspectives, we will show two different applications of deep learning. One is originated from a real world problem of automatic geophysical feature detection from seismic recordings to help oil & gas exploration; the other is motivated from a computational neuroscientific modeling and studying of human auditory system. More specifically, we will show how deep learning could be adapted to play nicely with the unique structures associated with the problems from different domains. Lastly, we move to the computer system design perspective, and present our efforts in building better deep learning systems to allow efficient and flexible computation in both academic and industrial worlds.
by Chiyuan Zhang.
Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

8

Pan, Jiajun. "Metric learning for structured data." Thesis, Nantes, 2019. http://www.theses.fr/2019NANT4076.

Full text

Abstract:

L’apprentissage à distance métrique est une branche de l’apprentissage par re-présentation des algorithmes d’apprentissage automatique. Nous résumons le développement et la situation actuelle de l’algorithme actuel d’apprentissage à distance métrique à partir des aspects de la base de données plate et de la base de données non plate. Pour une série d’algorithmes basés sur la distance de Mahalanobis pour la base de données plate qui ne parvient pas à exploiter l’intersection de trois dimensions ou plus, nous proposons un algorithme d’apprentissage métrique basé sur la fonction sousmodulaire. Pour le manque d’algorithmes d’apprentissage métrique pour les bases de données relationnelles dans des bases de données non plates, nous proposons LSCS (sélection de contraintes relationnelles de force relationnelle) pour la sélection de contraintes pour des algorithmes d’apprentissage métrique avec informations parallèles et MRML (Multi-Relation d’apprentissage métrique) qui somme la perte des contraintes relationnelles et les contraintes d’etiquetage. Grâce aux expériences de conception et à la vérification sur la base de données réelle, les algorithmes proposés sont meilleurs que les algorithmes actuels
Metric distance learning is a branch of re-presentation learning in machine learning algorithms. We summarize the development and current situation of the current metric distance learning algorithm from the aspects of the flat database and nonflat database. For a series of algorithms based on Mahalanobis distance for the flat database that fails to make full use of the intersection of three or more dimensions, we propose a metric learning algorithm based on the submodular function. For the lack of metric learning algorithms for relational databases in non-flat databases, we propose LSCS(Relational Link-strength Constraints Selection) for selecting constraints for metric learning algorithms with side information and MRML (Multi-Relation Metric Learning) which sums the loss from relationship constraints and label constraints. Through the design experiments and verification on the real database, the proposed algorithms are better than the current algorithms

APA, Harvard, Vancouver, ISO, and other styles

9

Qiao, Shi. "QUERYING GRAPH STRUCTURED RDF DATA." Case Western Reserve University School of Graduate Studies / OhioLINK, 2016. http://rave.ohiolink.edu/etdc/view?acc_num=case1447198654.

Full text

APA, Harvard, Vancouver, ISO, and other styles

10

Fok, Lordique(Lordique S. ). "Techniques for structured data discovery." Thesis, Massachusetts Institute of Technology, 2019. https://hdl.handle.net/1721.1/121671.

Full text

Abstract:

This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2019
Cataloged from student-submitted PDF version of thesis.
Includes bibliographical references (pages 63-64).
The discovery of structured data, or data that is tagged by key-value pairs, is a problem that can be subdivided into two issues: how best to structure information architecture and user interaction for discovery; and how to intelligently display data in a way that that optimizes the discovery of "useful" (i.e. relevant and helpful for a user's current use case) data. In this thesis, I investigate multiple methods of addressing both issues, and the results of evaluating these methods qualitatively and quantitatively. Specifically, I implement and evaluate: a novel interface design which combines different aspects of existing interfaces, two methods of diversifying data subsets given a search query, three methods of incorporating relevance in data subsets given a search query and information about the user's historic queries, a novel method of visualizing structured data, and two methods of inducing hierarchy on structured data in the presence of an partial data schema. These implementations and evaluations are shown to be effective in structuring information architecture and user interaction for structured data discovery, but are only partially effective in intelligently displaying data to optimize discovery of useful structured data.
by Lordique Fok.
M. Eng.
M.Eng. Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science

APA, Harvard, Vancouver, ISO, and other styles

11

Miner, andrew S. "Data structures for the analysis of large structured Markov models." W&M ScholarWorks, 2000. https://scholarworks.wm.edu/etd/1539623985.

Full text

Abstract:

High-level modeling formalisms are increasingly popular tools for studying complex systems. Given a high-level model, we can automatically verify certain system properties or compute performance measures about the system. In the general case, measures must be computed using discrete-event simulations. In certain cases, exact numerical analysis is possible by constructing and analyzing the underlying stochastic process of the system, which is a continuous-time Markov chain (CTMC) in our case. Unfortunately, the number of states in the underlying CTMC can be extremely large, even if the high-level model is "small". In this thesis, we develop data structures and techniques that can tolerate these large numbers of states.;First, we present a multi-level data structure for storing the set of reachable states of a model. We then introduce the concept of event "locality", which considers the components of the model that an event may affect. We show how a state generation algorithm using our multi-level structure can exploit event locality to reduce CPU requirements.;Then, we present a symbolic generation technique based on our multi-level structure and our concept of event locality, in which operations are applied to sets of states. The extremely compact data structure and efficient manipulation routines we present allow for the examination of much larger systems than was previously possible.;The transition rate matrix of the underlying CTMC can be represented with Kronecker algebra under certain conditions. However, the use of Kronecker algebra introduces several sources of CPU overhead during numerical solution. We present data structures, including our new data structure called matrix diagrams, that can reduce this CPU overhead. Using our techniques, we can compute measures for large systems in a fraction of the time required by current state-of-the-art techniques.;Finally, we present a technique for approximating stationary measures using aggregations of the underlying CTMC. Our technique utilizes exact knowledge of the underlying CTMC using our compact data structure for the reachable states and a Kronecker representation for the transition rates. We prove that the approximation is exact for models possessing a product-form solution.

APA, Harvard, Vancouver, ISO, and other styles

12

Schönauer, Stefan. "Efficient similarity search in structured data." [S.l.] : [s.n.], 2004. http://edoc.ub.uni-muenchen.de/archive/00001802.

Full text

APA, Harvard, Vancouver, ISO, and other styles

13

Schönauer, Stefan. "Efficient Similarity Search in Structured Data." Diss., lmu, 2004. http://nbn-resolving.de/urn:nbn:de:bvb:19-18022.

Full text

APA, Harvard, Vancouver, ISO, and other styles

14

Wackersreuther, Bianca. "Efficient Knowledge Extraction from Structured Data." Diss., lmu, 2011. http://nbn-resolving.de/urn:nbn:de:bvb:19-138079.

Full text

APA, Harvard, Vancouver, ISO, and other styles

15

Kashima, Hisashi. "Machine learning approaches for structured data." 京都大学 (Kyoto University), 2007. http://hdl.handle.net/2433/135953.

Full text

APA, Harvard, Vancouver, ISO, and other styles

16

Maksimovic, Gordana. "Query Languages for Semi-structured Data." Thesis, Blekinge Tekniska Högskola, Institutionen för programvaruteknik och datavetenskap, 2003. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-4332.

Full text

Abstract:

Semi-structured data is defined as irregular data with structure that may change rapidly or unpredictably. An example of such data can be found inside the World-Wide Web. Since the data is irregular, the user may not know the complete structure of the database. Thus, querying such data becomes a difficult issue. In order to write meaningful queries on semi-structured data, there is a need for a query language that will support the features that are presented by this data. Standard query languages, such as SQL for relational databases and OQL for object databases, are too constraining for querying semi-structured data, because they require data to conform to a fixed schema before any data is stored into the database. This paper introduces Lorel, a query language developed particularly for querying semi-structured data. Furthermore, it investigates if the standardised query languages support any of the criteria presented for semi-structured data. The result is an evaluation of three query languages, SQL, OQL and Lorel against these criteria.

APA, Harvard, Vancouver, ISO, and other styles

17

Ng, Kee Siong, and kee siong@rsise anu edu au. "Learning Comprehensible Theories from Structured Data." The Australian National University. Research School of Information Sciences and Engineering, 2005. http://thesis.anu.edu.au./public/adt-ANU20051031.105726.

Full text

Abstract:

This thesis is concerned with the problem of learning comprehensible theories from structured data and covers primarily classification and regression learning. The basic knowledge representation language is set around a polymorphically-typed, higher-order logic. The general setup is closely related to the learning from propositionalized knowledge and learning from interpretations settings in Inductive Logic Programming. Individuals (also called instances) are represented as terms in the logic. A grammar-like construct called a predicate rewrite system is used to define features in the form of predicates that individuals may or may not satisfy. For learning, decision-tree algorithms of various kinds are adopted.¶ The scope of the thesis spans both theory and practice. On the theoretical side, I study in this thesis¶ 1. the representational power of different function classes and relationships between them;¶ 2. the sample complexity of some commonly-used predicate classes, particularly those involving sets and multisets;¶ 3. the computational complexity of various optimization problems associated with learning and algorithms for solving them; and¶ 4. the (efficient) learnability of different function classes in the PAC and agnostic PAC models.¶ On the practical side, the usefulness of the learning system developed is demontrated with applications in two important domains: bioinformatics and intelligent agents. Specifically, the following are covered in this thesis:¶ 1. a solution to a benchmark multiple-instance learning problem and some useful lessons that can be drawn from it;¶ 2. a successful attempt on a knowledge discovery problem in predictive toxicology, one that can serve as another proof-of-concept that real chemical knowledge can be obtained using symbolic learning;¶ 3. a reworking of an exercise in relational reinforcement learning and some new insights and techniques we learned for this interesting problem; and¶ 4. a general approach for personalizing user agents that takes full advantage of symbolic learning.

APA, Harvard, Vancouver, ISO, and other styles

18

Thomson, Susan Elizabeth. "A storage service for structured data." Thesis, University of Cambridge, 1990. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.385486.

Full text

APA, Harvard, Vancouver, ISO, and other styles

19

NUNES, BERNARDO PEREIRA. "AUTOMATIC CLASSIFICATION OF SEMI-STRUCTURED DATA." PONTIFÍCIA UNIVERSIDADE CATÓLICA DO RIO DE JANEIRO, 2009. http://www.maxwell.vrac.puc-rio.br/Busca_etds.php?strSecao=resultado&nrSeq=14382@1.

Full text

Abstract:

PONTIFÍCIA UNIVERSIDADE CATÓLICA DO RIO DE JANEIRO
O problema da classificação de dados remonta à criação de taxonomias visando cobrir áreas do conhecimento. Com o surgimento da Web, o volume de dados disponíveis aumentou várias ordens de magnitude, tornando praticamente impossível a organização de dados manualmente. Esta dissertação tem por objetivo organizar dados semi-estruturados, representados por frames, sem uma estrutura de classes prévia. A dissertação apresenta um algoritmo, baseado no K-Medóide, capaz de organizar um conjunto de frames em classes, estruturadas sob forma de uma hierarquia estrita. A classificação dos frames é feita a partir de um critério de proximidade que leva em conta os atributos e valores que cada frame possui.
The problem of data classification goes back to the definition of taxonomies covering knowledge areas. With the advent of the Web, the amount of data available has increased several orders of magnitude, making manual data classification impossible. This dissertation proposes a method to automatically classify semi-structured data, represented by frames, without any previous knowledge about structured classes. The dissertation introduces an algorithm, based on K-Medoid, capable of organizing a set of frames into classes, structured as a strict hierarchy. The classification of the frames is based on a closeness criterion that takes into account the attributes and their values in each frame.

APA, Harvard, Vancouver, ISO, and other styles

20

Evanco, Kathleen L. (Kathleen Lee). "Customized data visualization using structured video." Thesis, Massachusetts Institute of Technology, 1996. http://hdl.handle.net/1721.1/29106.

Full text

APA, Harvard, Vancouver, ISO, and other styles

21

Lee, John Boaz T. "Deep Learning on Graph-structured Data." Digital WPI, 2019. https://digitalcommons.wpi.edu/etd-dissertations/570.

Full text

Abstract:

In recent years, deep learning has made a significant impact in various fields – helping to push the state-of-the-art forward in many application domains. Convolutional Neural Networks (CNN) have been applied successfully to tasks such as visual object detection, image super-resolution, and video action recognition while Long Short-term Memory (LSTM) and Transformer networks have been used to solve a variety of challenging tasks in natural language processing. However, these popular deep learning architectures (i.e., CNNs, LSTMs, and Transformers) can only handle data that can be represented as grids or sequences. Due to this limitation, many existing deep learning approaches do not generalize to problem domains where the data is represented as graphs – social networks in social network analysis or molecular graphs in chemoinformatics, for instance. The goal of this thesis is to help bridge the gap by studying deep learning solutions that can handle graph data naturally. In particular, we explore deep learning-based approaches in the following areas. 1. Graph Attention. In the real-world, graphs can be both large – with many complex patterns – and noisy which can pose a problem for effective graph mining. An effective way to deal with this issue is to use an attention-based deep learning model. An attention mechanism allows the model to focus on task-relevant parts of the graph which helps the model make better decisions. We introduce a model for graph classification which uses an attention-guided walk to bias exploration towards more task-relevant parts of the graph. For the task of node classification, we study a different model – one with an attention mechanism which allows each node to select the most task-relevant neighborhood to integrate information from. 2. Graph Representation Learning. Graph representation learning seeks to learn a mapping that embeds nodes, and even entire graphs, as points in a low-dimensional continuous space. The function is optimized such that the geometric distance between objects in the embedding space reflect some sort of similarity based on the structure of the original graph(s). We study the problem of learning time-respecting embeddings for nodes in a dynamic network. 3. Brain Network Discovery. One of the fundamental tasks in functional brain analysis is the task of brain network discovery. The brain is a complex structure which is made up of various brain regions, many of which interact with each other. The objective of brain network discovery is two-fold. First, we wish to partition voxels – from a functional Magnetic Resonance Imaging scan – into functionally and spatially cohesive regions (i.e., nodes). Second, we want to identify the relationships (i.e., edges) between the discovered regions. We introduce a deep learning model which learns to construct a group-cohesive partition of voxels from the scans of multiple individuals in the same group. We then introduce a second model which can recover a hierarchical set of brain regions, allowing us to examine the functional organization of the brain at different levels of granularity. Finally, we propose a model for the problem of unified and group-contrasting edge discovery which aims to discover discriminative brain networks that can help us to better distinguish between samples from different classes.

APA, Harvard, Vancouver, ISO, and other styles

22

Bandyopadhyay, Bortik. "Querying Structured Data via Informative Representations." The Ohio State University, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=osu1595447189545086.

Full text

APA, Harvard, Vancouver, ISO, and other styles

23

Folkesson, Carl. "Anonymization of directory-structured sensitive data." Thesis, Linköpings universitet, Institutionen för datavetenskap, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-160952.

Full text

Abstract:

Data anonymization is a relevant and important field within data privacy, which tries to find a good balance between utility and privacy in data. The field is especially relevant since the GDPR came into force, because the GDPR does not regulate anonymous data. This thesis focuses on anonymization of directory-structured data, which means data structured into a tree of directories. In the thesis, four of the most common models for anonymization of tabular data, k-anonymity, ℓ-diversity, t-closeness and differential privacy, are adapted for anonymization of directory-structured data. This adaptation is done by creating three different approaches for anonymizing directory-structured data: SingleTable, DirectoryWise and RecursiveDirectoryWise. These models and approaches are compared and evaluated using five metrics and three attack scenarios. The results show that there is always a trade-off between utility and privacy when anonymizing data. Especially it was concluded that the differential privacy model when using the RecursiveDirectoryWise approach gives the highest privacy, but also the highest information loss. On the contrary, the k-anonymity model when using the SingleTable approach or the t-closeness model when using the DirectoryWise approach gives the lowest information loss, but also the lowest privacy. The differential privacy model and the RecursiveDirectoryWise approach were also shown to give best protection against the chosen attacks. Finally, it was concluded that the differential privacy model when using the RecursiveDirectoryWise approach, was the most suitable combination to use when trying to follow the GDPR when anonymizing directory-structured data.

APA, Harvard, Vancouver, ISO, and other styles

24

Ng, Kee Siong. "Learning comprehensible theories from structured data /." View thesis entry in Australian Digital Theses Program, 2005. http://thesis.anu.edu.au/public/adt-ANU20051031.105726/index.html.

Full text

APA, Harvard, Vancouver, ISO, and other styles

25

Da, San Martino Giovanni <1979&gt. "Kernel Methods for Tree Structured Data." Doctoral thesis, Alma Mater Studiorum - Università di Bologna, 2009. http://amsdottorato.unibo.it/1400/1/thesis.pdf.

Full text

Abstract:

Machine learning comprises a series of techniques for automatic extraction of meaningful information from large collections of noisy data. In many real world applications, data is naturally represented in structured form. Since traditional methods in machine learning deal with vectorial information, they require an a priori form of preprocessing. Among all the learning techniques for dealing with structured data, kernel methods are recognized to have a strong theoretical background and to be effective approaches. They do not require an explicit vectorial representation of the data in terms of features, but rely on a measure of similarity between any pair of objects of a domain, the kernel function. Designing fast and good kernel functions is a challenging problem. In the case of tree structured data two issues become relevant: kernel for trees should not be sparse and should be fast to compute. The sparsity problem arises when, given a dataset and a kernel function, most structures of the dataset are completely dissimilar to one another. In those cases the classifier has too few information for making correct predictions on unseen data. In fact, it tends to produce a discriminating function behaving as the nearest neighbour rule. Sparsity is likely to arise for some standard tree kernel functions, such as the subtree and subset tree kernel, when they are applied to datasets with node labels belonging to a large domain. A second drawback of using tree kernels is the time complexity required both in learning and classification phases. Such a complexity can sometimes prevents the kernel application in scenarios involving large amount of data. This thesis proposes three contributions for resolving the above issues of kernel for trees. A first contribution aims at creating kernel functions which adapt to the statistical properties of the dataset, thus reducing its sparsity with respect to traditional tree kernel functions. Specifically, we propose to encode the input trees by an algorithm able to project the data onto a lower dimensional space with the property that similar structures are mapped similarly. By building kernel functions on the lower dimensional representation, we are able to perform inexact matchings between different inputs in the original space. A second contribution is the proposal of a novel kernel function based on the convolution kernel framework. Convolution kernel measures the similarity of two objects in terms of the similarities of their subparts. Most convolution kernels are based on counting the number of shared substructures, partially discarding information about their position in the original structure. The kernel function we propose is, instead, especially focused on this aspect. A third contribution is devoted at reducing the computational burden related to the calculation of a kernel function between a tree and a forest of trees, which is a typical operation in the classification phase and, for some algorithms, also in the learning phase. We propose a general methodology applicable to convolution kernels. Moreover, we show an instantiation of our technique when kernels such as the subtree and subset tree kernels are employed. In those cases, Direct Acyclic Graphs can be used to compactly represent shared substructures in different trees, thus reducing the computational burden and storage requirements.

APA, Harvard, Vancouver, ISO, and other styles

26

Da, San Martino Giovanni <1979&gt. "Kernel Methods for Tree Structured Data." Doctoral thesis, Alma Mater Studiorum - Università di Bologna, 2009. http://amsdottorato.unibo.it/1400/.

Full text

Abstract:

Machine learning comprises a series of techniques for automatic extraction of meaningful information from large collections of noisy data. In many real world applications, data is naturally represented in structured form. Since traditional methods in machine learning deal with vectorial information, they require an a priori form of preprocessing. Among all the learning techniques for dealing with structured data, kernel methods are recognized to have a strong theoretical background and to be effective approaches. They do not require an explicit vectorial representation of the data in terms of features, but rely on a measure of similarity between any pair of objects of a domain, the kernel function. Designing fast and good kernel functions is a challenging problem. In the case of tree structured data two issues become relevant: kernel for trees should not be sparse and should be fast to compute. The sparsity problem arises when, given a dataset and a kernel function, most structures of the dataset are completely dissimilar to one another. In those cases the classifier has too few information for making correct predictions on unseen data. In fact, it tends to produce a discriminating function behaving as the nearest neighbour rule. Sparsity is likely to arise for some standard tree kernel functions, such as the subtree and subset tree kernel, when they are applied to datasets with node labels belonging to a large domain. A second drawback of using tree kernels is the time complexity required both in learning and classification phases. Such a complexity can sometimes prevents the kernel application in scenarios involving large amount of data. This thesis proposes three contributions for resolving the above issues of kernel for trees. A first contribution aims at creating kernel functions which adapt to the statistical properties of the dataset, thus reducing its sparsity with respect to traditional tree kernel functions. Specifically, we propose to encode the input trees by an algorithm able to project the data onto a lower dimensional space with the property that similar structures are mapped similarly. By building kernel functions on the lower dimensional representation, we are able to perform inexact matchings between different inputs in the original space. A second contribution is the proposal of a novel kernel function based on the convolution kernel framework. Convolution kernel measures the similarity of two objects in terms of the similarities of their subparts. Most convolution kernels are based on counting the number of shared substructures, partially discarding information about their position in the original structure. The kernel function we propose is, instead, especially focused on this aspect. A third contribution is devoted at reducing the computational burden related to the calculation of a kernel function between a tree and a forest of trees, which is a typical operation in the classification phase and, for some algorithms, also in the learning phase. We propose a general methodology applicable to convolution kernels. Moreover, we show an instantiation of our technique when kernels such as the subtree and subset tree kernels are employed. In those cases, Direct Acyclic Graphs can be used to compactly represent shared substructures in different trees, thus reducing the computational burden and storage requirements.

APA, Harvard, Vancouver, ISO, and other styles

27

Tatikonda, Shirish. "Towards Efficient Data Analysis and Management of Semi-structured Data." The Ohio State University, 2010. http://rave.ohiolink.edu/etdc/view?acc_num=osu1275414859.

Full text

APA, Harvard, Vancouver, ISO, and other styles

28

Eichhorn, Jan. "Applications of kernel machines to structured data." [S.l.] : [s.n.], 2006. http://opus.kobv.de/tuberlin/volltexte/2007/1507.

Full text

APA, Harvard, Vancouver, ISO, and other styles

29

Lipton, Zachary C. "Learning from Temporally-Structured Human Activities Data." Thesis, University of California, San Diego, 2018. http://pqdtopen.proquest.com/#viewpdf?dispub=10683703.

Full text

Abstract:

Despite the extraordinary success of deep learning on diverse problems, these triumphs are too often confined to large, clean datasets and well-defined objectives. Face recognition systems train on millions of perfectly annotated images. Commercial speech recognition systems train on thousands of hours of painstakingly-annotated data. But for applications addressing human activity, data can be noisy, expensive to collect, and plagued by missing values. In electronic health records, for example, each attribute might be observed on a different time scale. Complicating matters further, deciding precisely what objective warrants optimization requires critical consideration of both algorithms and the application domain. Moreover, deploying human-interacting systems requires careful consideration of societal demands such as safety, interpretability, and fairness.

The aim of this thesis is to address the obstacles to mining temporal patterns in human activity data. The primary contributions are: (1) the first application of RNNs to multivariate clinical time series data, with several techniques for bridging long-term dependencies and modeling missing data; (2) a neural network algorithm for forecasting surgery duration while simultaneously modeling heteroscedasticity; (3) an approach to quantitative investing that uses RNNs to forecast company fundamentals; (4) an exploration strategy for deep reinforcement learners that significantly speeds up dialogue policy learning; (5) an algorithm to minimize the number of catastrophic mistakes made by a reinforcement learner; (6) critical works addressing model interpretability and fairness in algorithmic decision-making.

APA, Harvard, Vancouver, ISO, and other styles

30

Paaßen, Benjamin [Verfasser]. "Metric Learning for Structured Data / Benjamin Paaßen." Bielefeld : Universitätsbibliothek Bielefeld, 2019. http://d-nb.info/1186887818/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

31

Blampied, Paul Alexander. "Structured recursion for non-uniform data-types." Thesis, University of Nottingham, 2000. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.342028.

Full text

APA, Harvard, Vancouver, ISO, and other styles

32

Cottee, Michaela J. "The graphical representation of structured multivariate data." Thesis, Open University, 1996. http://oro.open.ac.uk/57616/.

Full text

Abstract:

During the past two decades or so, graphical representations have been used increasingly for the examination, summarisation and communication of statistical data. Many graphical techniques exist for exploratory data analysis (ie. for deciding which model it is appropriate to fit to the data) and a number of graphical diagnostic techniques exist for checking the appropriateness of a fitted model. However, very few techniques exist for the representation of the fitted model itself. This thesis is concerned with the development of some new and existing graphical representation techniques for the communication and interpretation of fitted statistical models. The first part of this thesis takes the form of a general overview of the use in statistics of graphical representations for exploratory data analysis and diagnostic model checking. In relation to the concern of this thesis, particular consideration is given to the few graphical techniques which already exist for the representation of fitted models. A number of novel two-dimensional approaches are then proposed which go partway towards providing a graphical representation of the main effects and interaction terms for fitted models. This leads on to a description of conditional independence graphs, and consideration of the suitability of conditional independence graphs as a technique for the representation of fitted models. Conditional independence graphs are then developed further in accordance with the research aims. Since it becomes apparent that it is not possible to use any of the approaches taken m order to develop a simple two-dimensional pen-and-paper technique for the unambiguous graphical representation of all fitted statistical models, an interactive computer package based on the conditional independence graph approach is developed for the construction, communication and interpretation of graphical representations for fitted statistical models. This package, called the "Conditional Independence Graph Enhancer" (CIGE), does provide unambiguous graphical representations for all fitted statistical models considered.

APA, Harvard, Vancouver, ISO, and other styles

33

Sun, Yizhi. "Statistical Analysis of Structured High-dimensional Data." Diss., Virginia Tech, 2018. http://hdl.handle.net/10919/97505.

Full text

Abstract:

High-dimensional data such as multi-modal neuroimaging data and large-scale networks carry excessive amount of information, and can be used to test various scientific hypotheses or discover important patterns in complicated systems. While considerable efforts have been made to analyze high-dimensional data, existing approaches often rely on simple summaries which could miss important information, and many challenges on modeling complex structures in data remain unaddressed. In this proposal, we focus on analyzing structured high-dimensional data, including functional data with important local regions and network data with community structures. The first part of this dissertation concerns the detection of ``important'' regions in functional data. We propose a novel Bayesian approach that enables region selection in the functional data regression framework. The selection of regions is achieved through encouraging sparse estimation of the regression coefficient, where nonzero regions correspond to regions that are selected. To achieve sparse estimation, we adopt compactly supported and potentially over-complete basis to capture local features of the regression coefficient function, and assume a spike-slab prior to the coefficients of the bases functions. To encourage continuous shrinkage of nearby regions, we assume an Ising hyper-prior which takes into account the neighboring structure of the bases functions. This neighboring structure is represented by an undirected graph. We perform posterior sampling through Markov chain Monte Carlo algorithms. The practical performance of the proposed approach is demonstrated through simulations as well as near-infrared and sonar data. The second part of this dissertation focuses on constructing diversified portfolios using stock return data in the Center for Research in Security Prices (CRSP) database maintained by the University of Chicago. Diversification is a risk management strategy that involves mixing a variety of financial assets in a portfolio. This strategy helps reduce the overall risk of the investment and improve performance of the portfolio. To construct portfolios that effectively diversify risks, we first construct a co-movement network using the correlations between stock returns over a training time period. Correlation characterizes the synchrony among stock returns thus helps us understand whether two or multiple stocks have common risk attributes. Based on the co-movement network, we apply multiple network community detection algorithms to detect groups of stocks with common co-movement patterns. Stocks within the same community tend to be highly correlated, while stocks across different communities tend to be less correlated. A portfolio is then constructed by selecting stocks from different communities. The average return of the constructed portfolio over a testing time period is finally compared with the SandP 500 market index. Our constructed portfolios demonstrate outstanding performance during a non-crisis period (2004-2006) and good performance during a financial crisis period (2008-2010).
PHD

APA, Harvard, Vancouver, ISO, and other styles

34

Tu, Ying. "Focus-based Interactive Visualization for Structured Data." The Ohio State University, 2013. http://rave.ohiolink.edu/etdc/view?acc_num=osu1366198735.

Full text

APA, Harvard, Vancouver, ISO, and other styles

35

Morris, Henry. "Sparse nonlinear methods for predicting structured data." Thesis, Imperial College London, 2012. http://hdl.handle.net/10044/1/9548.

Full text

Abstract:

Gaussian processes are now widely used to perform key machine learning tasks such as nonlinear regression and classification. An attractive feature of Gaussian process models is the behaviour of the error bars, which grow in regions away from observations where there is high uncertainty about the interpolating function. The complexity of these models scales as O(N3) with sample size, which causes difficulties with large data sets. The goals of this work are to develop nonlinear, nonparametric modelling techniques for structure learning and prediction problems in which there are structured dependencies among the observed data, and to equip our models with sparse representations which serve both to handle prior sparse connectivity assumptions and to reduce computational complexity. We present Kernel Dynamical Structure Learning, a Bayesian method for learning the structure of interactions between variables in multivariate time-series. We design a mutual information kernel to handle time-series trajectories, and show that prior knowledge about network sparsity can be incorporated using heavy-tailed priors over parameters. We evaluate the feasibility of our method on synthetic data, and extend the inference methodology to the handling of uncertain input data. Next, we tackle the problem of belief propagation in Bayesian networks with nonlinear node relations. We propose an exact moment-matching approach for nonlinear belief propagation in any tree-structured graph. We call this Gaussian Process Belief Propagation. We extend this approach by the addition of hidden variables which allow nodes sharing common influences to be conditionally independent. This constitutes a novel approach to multi-output regression on bivariate graph structures, and we call this Dependent Gaussian Process Belief Propagation. We describe sparse inference methods for both models, which reduce computational by learning compact parameterisations of the available training data. We then apply our method to the real-world systems biology problem of protein inference in transcriptional networks.

APA, Harvard, Vancouver, ISO, and other styles

36

Bala, Saimir. "Mining Projects from Structured and Unstructured Data." Jens Gulden, Selmin Nurcan, Iris Reinhartz-Berger, Widet Guédria, Palash Bera, Sérgio Guerreiro, Michael Fellman, Matthias Weidlich, 2017. http://epub.wu.ac.at/7205/1/ProjecMining%2DCamera%2DReady.pdf.

Full text

Abstract:

Companies working on safety-critical projects must adhere to strict rules imposed by the domain, especially when human safety is involved. These projects need to be compliant to standard norms and regulations. Thus, all the process steps must be clearly documented in order to be verifiable for compliance in a later stage by an auditor. Nevertheless, documentation often comes in the form of manually written textual documents in different formats. Moreover, the project members use diverse proprietary tools. This makes it difficult for auditors to understand how the actual project was conducted. My research addresses the project mining problem by exploiting logs from project-generated artifacts, which come from software repositories used by the project team.

APA, Harvard, Vancouver, ISO, and other styles

37

King, Michael Allen. "Ensemble Learning Techniques for Structured and Unstructured Data." Diss., Virginia Tech, 2015. http://hdl.handle.net/10919/51667.

Full text

Abstract:

This research provides an integrated approach of applying innovative ensemble learning techniques that has the potential to increase the overall accuracy of classification models. Actual structured and unstructured data sets from industry are utilized during the research process, analysis and subsequent model evaluations. The first research section addresses the consumer demand forecasting and daily capacity management requirements of a nationally recognized alpine ski resort in the state of Utah, in the United States of America. A basic econometric model is developed and three classic predictive models evaluated the effectiveness. These predictive models were subsequently used as input for four ensemble modeling techniques. Ensemble learning techniques are shown to be effective. The second research section discusses the opportunities and challenges faced by a leading firm providing sponsored search marketing services. The goal for sponsored search marketing campaigns is to create advertising campaigns that better attract and motivate a target market to purchase. This research develops a method for classifying profitable campaigns and maximizing overall campaign portfolio profits. Four traditional classifiers are utilized, along with four ensemble learning techniques, to build classifier models to identify profitable pay-per-click campaigns. A MetaCost ensemble configuration, having the ability to integrate unequal classification cost, produced the highest campaign portfolio profit. The third research section addresses the management challenges of online consumer reviews encountered by service industries and addresses how these textual reviews can be used for service improvements. A service improvement framework is introduced that integrates traditional text mining techniques and second order feature derivation with ensemble learning techniques. The concept of GLOW and SMOKE words is introduced and is shown to be an objective text analytic source of service defects or service accolades.
Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

38

Sundaravadivelu, Rathinasabapathy. "Interoperability between heterogeneous and distributed biodiversity data sources in structured data networks." Thesis, Cardiff University, 2010. http://orca.cf.ac.uk/18086/.

Full text

Abstract:

The extensive capturing of biodiversity data and storing them in heterogeneous information systems that are accessible on the internet across the globe has created many interoperability problems. One is that the data providers are independent of others and they can run systems which were developed on different platforms at different times using different software products to respond to different needs of information. A second arises from the data modelling used to convert the real world data into a computerised data structure which is not conditioned by a universal standard. Most importantly the need for interoperation between these disparate data sources is to get accurate and useful information for further analysis and decision making. The software representation of a universal or a single data definition structure for depicting a biodiversity entity is ideal. But this is not necessarily possible when integrating data from independently developed systems. The different perspectives of the real-world entity when being modelled by independent teams will result in the use of different terminologies, definition and representation of attributes and operations for the same real-world entity. The research in this thesis is concerned with designing and developing an interoperable flexible framework that allows data integration between various distributed and heterogeneous biodiversity data sources that adopt XML standards for data communication. In particular the problems of scope and representational heterogeneity among the various XML data schemas are addressed. To demonstrate this research a prototype system called BUFFIE (Biodiversity Users‘ Flexible Framework for Interoperability Experiments) was designed using a hybrid of Object-oriented and Functional design principles. This system accepts the query information from the user in a web form, and designs an XML query. This request query is enriched and is made more specific to data providers using the data provider information stored in a repository. These requests are sent to the different heterogeneous data resources across the internet using HTTP protocol. The responses received are in varied XML formats which are integrated using knowledge mapping rules defined in XSLT & XML. The XML mappings are derived from a biodiversity domain knowledgebase defined for schema mappings of different data exchange protocols. The integrated results are presented to users or client programs to do further analysis. The main results of this thesis are: (1) A framework model that allows interoperation between the heterogeneous data source systems. (2) Enriched querying improves the accuracy of responses by finding the correct information existing among autonomous, distributed and heterogeneous data resources. (3) A methodology that provides a foundation for extensibility as any new network data standards in XML can be added to the existing protocols. The presented approach shows that (1) semi automated mapping and integration of datasets from the heterogeneous and autonomous data providers is feasible. (2) Query enriching and integrating the data allows the querying and harvesting of useful data from various data providers for helpful analysis.

APA, Harvard, Vancouver, ISO, and other styles

39

Otaki, Keisuke. "Algorithmic Approaches to Pattern Mining from Structured Data." 京都大学 (Kyoto University), 2016. http://hdl.handle.net/2433/215673.

Full text

Abstract:

The contents of Chapter 6 are based on work published in IPSJ Transactions on Mathematical Modeling and Its Applications, vol.9(1), pp.32-42, 2016.
Kyoto University (京都大学)
0048
新制・課程博士
博士(情報学)
甲第19846号
情博第597号
新制||情||104(附属図書館)
32882
京都大学大学院情報学研究科知能情報学専攻
(主査)教授山本章博, 教授鹿島久嗣, 教授阿久津達也
学位規則第4条第1項該当

APA, Harvard, Vancouver, ISO, and other styles

40

Zhao, Xiaoyan 1966. "Trie methods for structured data on secondary storage." Thesis, McGill University, 2000. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=36855.

Full text

Abstract:

This thesis presents trie organizations for one-dimensional and multidimensional structured data on secondary storage. The new trie structures have several distinctive features: (1) they provide significant storage compression by sharing common paths near the root; (2) they are partitioned into pages and are suitable for secondary storage; (3) they are capable of dynamic insertions and deletions of records; (4) they support efficient multidimensional variable-resolution queries by storing the most significant bits near the root.
We apply the trie structures to indexing, storing and querying structured data on secondary storage. We are interested in the storage compactness, the I/O efficiency, the order-preserving properties, the general orthogonal range queries and the exact match queries for very large files and databases. We also apply the trie structures to relational joins (set operations).
We compare trie structures to various data structures on secondary storage: multipaging and grid files in the direct access method category, R-trees/R*-trees and X-trees in the logarithmic access cost category, as well as some representative join algorithms for performing join operations. Our results show that range queries by trie method are superior to these competitors in search cost when queries return more than a few records and are competitive to direct access methods for exact match queries. Furthermore, as the trie structure compresses data, it is the winner in terms of storage compared to all other methods mentioned above.
We also present a new tidy function for order-preserving key-to-address transformation. Our tidy function is easy to construct and cheaper in access time and storage cost compared to its closest competitor.

APA, Harvard, Vancouver, ISO, and other styles

41

NUNES, IAN MONTEIRO. "CLUSTERING TEXT STRUCTURED DATA BASED ON TEXT SIMILARITY." PONTIFÍCIA UNIVERSIDADE CATÓLICA DO RIO DE JANEIRO, 2008. http://www.maxwell.vrac.puc-rio.br/Busca_etds.php?strSecao=resultado&nrSeq=25796@1.

Full text

Abstract:

PONTIFÍCIA UNIVERSIDADE CATÓLICA DO RIO DE JANEIRO
COORDENAÇÃO DE APERFEIÇOAMENTO DO PESSOAL DE ENSINO SUPERIOR
PROGRAMA DE EXCELENCIA ACADEMICA
O presente trabalho apresenta os resultados que obtivemos com a aplicação de grande número de modelos e algoritmos em um determinado conjunto de experimentos de agrupamento de texto. O objetivo de tais testes é determinar quais são as melhores abordagens para processar as grandes massas de informação geradas pelas crescentes demandas de data quality em diversos setores da economia. O processo de deduplicação foi acelerado pela divisão dos conjuntos de dados em subconjuntos de itens similares. No melhor cenário possível, cada subconjunto tem em si todas as ocorrências duplicadas de cada registro, o que leva o nível de erro na formação de cada grupo a zero. Todavia, foi determinada uma taxa de tolerância intrínseca de 5 porcento após o agrupamento. Os experimentos mostram que o tempo de processamento é significativamente menor e a taxa de acerto é de até 98,92 porcento. A melhor relação entre acurácia e desempenho é obtida pela aplicação do algoritmo K-Means com um modelo baseado em trigramas.
This document reports our findings on a set of text clusterig experiments, where a wide variety of models and algorithms were applied. The objective of these experiments is to investigate which are the most feasible strategies to process large amounts of information in face of the growing demands on data quality in many fields. The process of deduplication was accelerated through the division of the data set into individual subsets of similar items. In the best case scenario, each subset must contain all duplicates of each produced register, mitigating to zero the cluster s errors. It is established, although, a tolerance of 5 percent after the clustering process. The experiments show that the processing time is significantly lower, showing a 98,92 percent precision. The best accuracy/performance relation is achieved with the K-Means Algorithm using a trigram based model.

APA, Harvard, Vancouver, ISO, and other styles

42

Gianniotis, Nikolaos. "Visualisation of structured data through generative probabilistic modeling." Thesis, University of Birmingham, 2008. http://etheses.bham.ac.uk//id/eprint/4803/.

Full text

Abstract:

This thesis is concerned with the construction of topographic maps of structured data. A probabilistic generative model-based approach is taken, inspired by the GTM algorithm. De- pending on the data at hand, the form of a probabilistic generative model is specified that is appropriate for modelling the probability density of the data. A mixture of such models is formulated which is topographically constrained on a low-dimensional latent space. By con- strained, we mean that each point in the latent space determines the parameters of one model via a smooth non-linear mapping; by topographic, we mean that neighbouring latent points gen- erate similar parameters which address statistically similar models. The constrained mixture is trained to model the density of the structured data. A map is constructed by projecting each data item to a location on the latent space where the local latent points are associated with models that express a high probability of having generated the particular data item. We present three formulations for constructing topographic maps of structured data. Two of them are concerned with tree-structured data and employ hidden Markov trees and Markov trees as probabilistic generative models. The third approach is concerned with astronomical light curves from eclipsing binary stars and employs a physical-based model. The formulation of the all three models is accompanied by experiments and analysis of the resulting topographic maps.

APA, Harvard, Vancouver, ISO, and other styles

43

Stachowiak, Maciej 1976. "Automated extraction of structured data from HTML documents." Thesis, Massachusetts Institute of Technology, 1998. http://hdl.handle.net/1721.1/9896.

Full text

Abstract:

Thesis (M.Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1998.
Includes bibliographical references (leaf 45).
by Maciej Stachowiak.
M.Eng.

APA, Harvard, Vancouver, ISO, and other styles

44

Zhang, Peng. "Structured sensing for estimation of high-dimensional data." Thesis, Imperial College London, 2016. http://hdl.handle.net/10044/1/49415.

Full text

Abstract:

Efficient estimation and processing of high-dimensional data is important in many scientic and engineering domains. In this thesis, we explore structured sensing methods for high-dimensional signal in three different perspectives: structured random matrices for compressed sensing and corrupted sensing, atomic norm regularization for massive multiple-input-multiple-output (MIMO) systems and variable density sampling for random field. Designing efficient sensing systems for high-dimensional data by appealing to the prior knowledge that their intrinsic information is usually small has become popular in recent years. As a starting point, compressed sensing has proven to be feasible for estimating sparse signals when the number of measurements is far less than the dimensionality of the signals. Besides fully random sensing matrices, many structured sensing matrices have been designed to reduce the computation and storage cost. We propose a unified structured sensing framework and prove the associated restricted isometry property. We demonstrate that the proposed framework encompasses many existing designs. In addition, we construct new structured sensing models based on the proposed framework. Furthermore, we consider a generalized problem where the compressive measurements are affected by both dense noise and sparse corruption. We show that in some cases the proposed framework can still guarantee faithful recovery for both the sparse signal and the corruption. The next part of the thesis is concerned with channel estimation and faulty antennas detection in massive MIMO systems. By leveraging the intrinsic information of the channel matrix through atomic norm, we propose new algorithms and demonstrate their performances for both channel estimation and faulty antennas detection. In the last part, we propose a variable density sampling method for the estimation of high-dimensional random field. While conventional uniform sampling requires a number of samples increasing exponentially with the dimension, we show that faithful recovery can be guaranteed with a polynomial size of random samples.

APA, Harvard, Vancouver, ISO, and other styles

45

Forshaw, Gareth William. "Semi-automatic matching of semi-structured data updates." Master's thesis, University of Cape Town, 2014. http://hdl.handle.net/11427/12930.

Full text

Abstract:

Includes bibliographical references.
Data matching, also referred to as data linkage or field matching, is a technique used to combine multiple data sources into one data set. Data matching is used for data integration in a number of sectors and industries; from politics and health care to scientific applications. The motivation for this study was the observation of the day-to-day struggles of a large non-governmental organisation (NGO) in managing their membership database. With a membership base of close to 2.4 million, the challenges they face with regard to the capturing and processing of the semi-structured membership updates are monumental. Updates arrive from the field in a multitude of formats, often incomplete and unstructured, and expert knowledge is geographically localised. These issues are compounded by an extremely complex organisational hierarchy and a general lack of data validation processes. An online system was proposed for pre-processing input and then matching it against the membership database. Termed the Data Pre-Processing and Matching System (DPPMS), it allows for single or bulk updates. Based on the success of the DPPMS with the NGO’s membership database, it was subsequently used for pre-processing and data matching of semi-structured patient and financial customer data. Using the semi-automated DPPMS rather than a clerical data matching system, true positive matches increased by 21% while false negative matches decreased by 20%. The Recall, Precision and F-Measure values all improved and the risk of false positives diminished. The DPPMS was unable to match approximately 8% of provided records; this was largely due to human error during initial data capture. While the DPPMS greatly diminished the reliance on experts, their role remained pivotal during the final stage of the process.

APA, Harvard, Vancouver, ISO, and other styles

46

Ni, Weizeng. "Ontology-based Feature Construction on Non-structured Data." University of Cincinnati / OhioLINK, 2015. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1439309340.

Full text

APA, Harvard, Vancouver, ISO, and other styles

47

Herrmann, Kai, Hannes Voigt, and Wolfgang Lehner. "Cinderella - Adaptive Online Partitioning of Irregularly Structured Data." IEEE, 2014. https://tud.qucosa.de/id/qucosa%3A75273.

Full text

Abstract:

In an increasing number of use cases, databases face the challenge of managing irregularly structured data. Irregularly structured data is characterized by a quickly evolving variety of entities without a common set of attributes. These entities do not show enough regularity to be captured in a traditional database schema. A common solution is to centralize the diverse entities in a universal table. Usually, this leads to a very sparse table. Although today's techniques allow efficient storage of sparse universal tables, query efficiency is still a problem. Queries that reference only a subset of attributes have to read the whole universal table including many irrelevant entities. One possible solution is to use a partitioning of the table, which allows pruning partitions of irrelevant entities before they are touched. Creating and maintaining such a partitioning manually is very laborious or even infeasible, due to the enormous complexity. Thus an autonomous solution is desirable. In this paper, we define the Online Partitioning Problem for irregularly structured data and present Cinderella. Cinderella is an autonomous online algorithm for horizontal partitioning of irregularly structured entities in universal tables. It is designed to keep its overhead low by incrementally assigning entities to partitions while they are touched anyway during modifications. The achieved partitioning allows queries that retrieve only entities with a subset of attributes easily pruning partitions of irrelevant entities. Cinderella increases the locality of queries and reduces query execution cost.

APA, Harvard, Vancouver, ISO, and other styles

48

Weng, Daiyue. "Extracting structured data from Web query result pages." Thesis, Queen's University Belfast, 2016. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.709858.

Full text

Abstract:

A rapidly increasing number of Web databases are now become accessible via their HTML form- based query interfaces only. Comparing various services or products from a number of web sites in a specific domain is time-consuming and tedious. There is a demand for value-added Web applications that integrate data from multiple sources. To facilitate the development of such applications, we need to develop techniques for automating the process of providing integrated access to a multitude of database-driven Web sites, and integrating data from their underlying databases. This presents three challenges, namely query form extraction, query form matching and translation, and Web query result extraction. In this thesis, 1 focus on Web query result extraction, which aims to extract structured data encoded in semi-structured HTML pages, and return extracted data in relational tables. 1 begin by reviewing the existing approaches for Web query result extraction. 1 categorize them based on their degree of automation, i.e. manual, semi-automatic and fully automatic approaches. For each category, every approach will be described in terms of its technical features, followed by an analysis listing the advantages and limitations of the approach. The literature review leads to my proposed approaches, which resolve the Web data extraction problem, i.e. Web data record extraction, Web data alignment and Web data annotation. Each approach is presented in a chapter which includes the methodology, experiment and related work. The last chapter concludes the thesis.

APA, Harvard, Vancouver, ISO, and other styles

49

Doan, AnHai. "Learning to map between structured representations of data /." Thesis, Connect to this title online; UW restricted, 2002. http://hdl.handle.net/1773/6968.

Full text

APA, Harvard, Vancouver, ISO, and other styles

50

Bui, Dang Bach. "Mining complex structured data: Enhanced methods and applications." Thesis, Curtin University, 2015. http://hdl.handle.net/20.500.11937/480.

Full text

Abstract:

Conventional approaches to analysing complex business data typically rely on process models, which are difficult to construct and use. This thesis addresses this issue by converting semi-structured event logs to a simpler flat representation without any loss of information, which then enables direct applications of classical data mining methods. The thesis also proposes an effective and scalable classification method which can identify distinct characteristics of a business process for further improvements.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Structured data'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles