To see the other types of publications on this topic, follow the link: Data extractions.

Dissertations / Theses on the topic 'Data extractions'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 dissertations / theses for your research on the topic 'Data extractions.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Minh, Tuan Pham, Tomohiro Yoshikawa, Takeshi Furuhashi, and Kaita Tachibana. "Robust feature extractions from geometric data using geometric algebra." IEEE, 2009. http://hdl.handle.net/2237/13896.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Dou, Lixin. "Applications of Bayesian inference methods to time series data analysis and hyperfine parameter extractions in Mössbauer spectroscopy." Thesis, National Library of Canada = Bibliothèque nationale du Canada, 1999. http://www.collectionscanada.ca/obj/s4/f2/dsk1/tape9/PQDD_0020/NQ45170.pdf.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Dou, Lixin. "Applications of Bayesian inference methods to time series data analysis and hyperfine parameter extractions in Mossbauer spectroscopy." Thesis, University of Ottawa (Canada), 1999. http://hdl.handle.net/10393/8483.

Full text
Abstract:
The Bayesian statistical inference theory is studied and applied to two problems in applied physics: spectral analysis and parameter estimation in time series data and hyperfine parameter extraction in Mossbauer spectroscopy. The applications to spectral analysis and parameter estimation for both single- and multiple-frequency signals are presented in detail. Specifically, the marginal posterior probabilities for the amplitudes and frequencies of the signals are obtained by using Gibbs sampling without performing the integration, no matter whether the variance of the noise is known or unknown. The best estimates of the parameters can be inferred from these probabilities together with the corresponding variances. When the variance of the noise is unknown, an estimate about the variance of the noise can also be made. Comparisons of our results have been made with results using the Fast Fourier Transformation (FFT) method as well as Bretthorst's method. The same numerical approach is applied to some complicated models and conditions, such as periodic but non-harmonic signals, signals with decay, and signals with chirp. Results demonstrate that even under these complicated conditions the Bayesian inference and Gibbs sampling can still give very accurate results with respect to the true result. Also through the use of the Bayesian inference methods it is possible to choose the most probable model based on known prior information of data, assuming a model space. The Bayesian inference theory is applied to hyperfine parameter extraction in Mossbauer spectroscopy for the first time. The method is a free-form model extraction approach and gives full error analysis of hyperfine parameter distributions. Two applications to quadrupole splitting distribution analysis in Fe-57 Mossbauer spectroscopy are presented. One involves a single site of Fe3+ and the other involves two sites for Fe3+ and Fe2+. In each case the method gives a unique solution to the distributions with arbitrary shape and is not sensitive to the elemental doublet parameters. The Bayesian inference theory is also applied to the hyperfine field distribution extraction. Because of the complexity of the elemental lineshape, all the other extraction methods can only use the first order perturbation sextet as the lineshape function. We use Blaes' exact lineshape model to extract the hyperfine field distribution. This is possible because the Bayesian inference theory is a free-form model extraction method. By using Blaes' lineshape function, different cases of orientations between the electric field gradient principle axis directions and the magnetic hyperfine field can be studied without making any approximations. As an example the ground state hyperfine field distribution of Fe65Ni35 Invar is extensively studied by using the method. Some very interesting features of the hyperfine field distribution are identified.
APA, Harvard, Vancouver, ISO, and other styles
4

Shakir, Amer, Muhammad Hammad, and Muhammad Kamran. "Comparative Analysis & Study of Android/iOS MobileForensics Tools." Thesis, Högskolan i Halmstad, Akademin för informationsteknologi, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:hh:diva-44797.

Full text
Abstract:
This report aims to draw a comparison between two commercial mobile forensics and recovery tools, Magnet AXIOM and MOBILedit. A thorough look at previously done studies was helpful to know what aspects of the data extractions must be compared and which areas are the most important ones to focus upon. This work focuses on how the data extracted from one tool compares with another and provides comprehensive extraction based on different scenarios, circumstances, and aspects. Performances of both tools are compared based on various benchmarks and criteria. This study has helped establish that MOBILedit has been able to outperform Magnet AXIOM on more data extraction and recovery aspects. It is a comparatively better tool to get your hands on.
APA, Harvard, Vancouver, ISO, and other styles
5

Sottovia, Paolo. "Information Extraction from data." Doctoral thesis, Università degli studi di Trento, 2019. http://hdl.handle.net/11572/242992.

Full text
Abstract:
Data analysis is the process of inspecting, cleaning, extract, and modeling data with the intention of extracting useful information in order to support users in their decisions. With the advent of Big Data, data analysis was becoming more complicated due to the volume and variety of data. This process begins with the acquisition of the data and the selection of the data that is useful for the desiderata analysis. With such amount of data, also expert users are not able to inspect the data and understand if a dataset is suitable or not for their purposes. In this dissertation, we focus on five problems in the broad data analysis process to help users find insights from the data when they do not have enough knowledge about its data. First, we analyze the data description problem, where the user is looking for a description of the input dataset. We introduce data descriptions: a compact, readable and insightful formula of boolean predicates that represents a set of data records. Finding the best description for a dataset is computationally expensive and task-specific; we, therefore, introduce a set of metrics and heuristics for generating meaningful descriptions at an interactive performance. Secondly, we look at the problem of order dependency discovery, which discovers another kind of metadata that may help the user in the understanding of characteristics of a dataset. Our approach leverages the observation that discovering order dependencies can be guided by the discovery of a more specific form of dependencies called order compatibility dependencies. Thirdly, textual data encodes much hidden information. To allow this data to reach its full potential, there has been an increasing interest in extracting structural information from it. In this regard, we propose a novel approach for extracting events that are based on temporal co-reference among entities. We consider an event to be a set of entities that collectively experience relationships between them in a specific period of time. We developed a distributed strategy that is able to scale with the largest on-line encyclopedia available, Wikipedia. Then, we deal with the evolving nature of the data by focusing on the problem of finding synonymous attributes in evolving Wikipedia Infoboxes. Over time, several attributes have been used to indicate the same characteristic of an entity. This provides several issues when we are trying to analyze the content of different time periods. To solve it, we propose a clustering strategy that combines two contrasting distance metrics. We developed an approximate solution that we assess over 13 years of Wikipedia history by proving its flexibility and accuracy. Finally, we tackle the problem of identifying movements of attributes in evolving datasets. In an evolving environment, entities not only change their characteristics, but they sometimes exchange them over time. We proposed a strategy where we are able to discover those cases, and we also test our strategy on real datasets. We formally present the five problems that we validate both in terms of theoretical results and experimental evaluation, and we demonstrate that the proposed approaches efficiently scale with a large amount of data.
APA, Harvard, Vancouver, ISO, and other styles
6

Raza, Ali. "Test Data Extraction and Comparison with Test Data Generation." DigitalCommons@USU, 2011. https://digitalcommons.usu.edu/etd/982.

Full text
Abstract:
Testing an integrated information system that relies on data from multiple sources can be a challenge, particularly when the data is confidential. This thesis describes a novel test data extraction approach, called semantic-based test data extraction for integrated systems (iSTDE) that solves many of the problems associated with creating realistic test data for integrated information systems containing confidential data. iSTDE reads a consistent cross-section of data from the production databases, manipulates that data to obscure individual identities while still preserving overall semantic data characteristics that are critical to thorough system testing, and then moves that test data to an external test environment. This thesis also presents a theoretical study that compares test-data extraction with a competing technique, named test-data generation. Specifically, this thesis a) describes a comparison method that includes a comprehensive list of characteristics essential for testing the database applications organized into seven different areas, b) presents an analysis of the relative strengths and weaknesses of the different test-data creation techniques, and c) reports a number of specific conclusions that will help testers make appropriate choices.
APA, Harvard, Vancouver, ISO, and other styles
7

Wackersreuther, Bianca. "Efficient Knowledge Extraction from Structured Data." Diss., lmu, 2011. http://nbn-resolving.de/urn:nbn:de:bvb:19-138079.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Thelen, Andrea. "Optimized surface extraction from holographic data." [S.l.] : [s.n.], 2006. http://deposit.ddb.de/cgi-bin/dokserv?idn=980418798.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Zhou, Yuanqiu. "Generating Data-Extraction Ontologies By Example." Diss., CLICK HERE for online access, 2005. http://contentdm.lib.byu.edu/ETD/image/etd1115.pdf.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Williams, Dean Ashley. "Combining data integration and information extraction." Thesis, Birkbeck (University of London), 2008. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.499152.

Full text
APA, Harvard, Vancouver, ISO, and other styles
11

Heys, Richard. "Extraction of anthropological data with ultrasound." Thesis, Brunel University, 2007. http://bura.brunel.ac.uk/handle/2438/7896.

Full text
Abstract:
Human body scanners used to extract anthropological data have a significant drawback, the subject is required to undress or wear tight fitting clothing. This thesis demonstrates an ultrasonic based alternative to the current optical systems, that can potentially operate on a fully clothed subject. To validate the concept several experiments were performed to determine the acoustic properties of multiple garments. The results indicated that such an approach was possible. Beamforming is introduced as a method by which the ultrasonic scanning area can be increased, the concept is thoroughly studied and a clear theoretical analysis is performed. Additionally, Matlab has been used to demonstrate graphically, the results of such analysis, providing an invaluable tool during the simulation, experimental and results stages of the thesis. To evaluate beamfoming as a composite part of ultrasonic body imaging, a hardware solution was necessary. During the concept phase, both FPGA and digital signal processors were evaluated to determine their suitability for the role. An FPGA approach was finally chosen, which allows highly parallel operation, essential to the high acquisition speeds required by some beamforming methodologies. In addition, analogue circuitry was also designed to provide an interface with the ultrasonic transducers, which, included variable gain amplifiers, charge amplifiers and signal conditioning. Finally, a digital acquisition card was used to transfer data between the FPGA and a desktop computer, on which, the sampled data was processed and displayed in a coherent graphical manner. The beamforming results clearly demonstrate that imaging multiple layers in air, with ultrasound, is a viable technique for anthroplogical data collection. Furthermore, a wavelet based method of improving the axial resolution is also proposed and demonstrated.
APA, Harvard, Vancouver, ISO, and other styles
12

Shunmugam, Nagarajan. "Operational data extraction using visual perception." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-292216.

Full text
Abstract:
The information era has led the manufacturer of trucks and logistics solution providers are inclined towards software as a service (SAAS) based solutions. With advancements in software technologies like artificial intelligence and deep learning, the domain of computer vision has achieved significant performance boosts that it competes with hardware based solutions. Firstly, data is collected from a large number of sensors which can increase production costs and carbon footprint in the environment. Secondly certain useful physical quantities/variables are impossible to measure or turns out to be very expensive solution. So in this dissertation, we are investigating the feasibility of providing the similar solution using a single sensor (dashboard- camera) to measure multiple variables. This provides a sustainable solution even when scaled up in huge fleets. The video frames that can be collected from the visual perception of the truck (i.e. the on-board camera of the truck) is processed by the deep learning techniques and operational data can be extracted. Certain techniques like the image classification and semantic segmentation outputs were experimented and shows potential to replace costly hardware counterparts like Lidar or radar based solutions.
Informationstiden har lett till att tillverkare av lastbilar och logistiklösningsleve -rantörer är benägna mot mjukvara som en tjänst (SAAS) baserade lösningar. Med framsteg inom mjukvaruteknik som artificiell intelligens och djupinlärnin har domänen för datorsyn uppnått betydande prestationsförstärkningar att konkurrera med hårdvarubaserade lösningar. För det första samlas data in från ett stort antal sensorer som kan öka produktionskostnaderna och koldioxidavtry -cket i miljön. För det andra är vissa användbara fysiska kvantiteter / variabler omöjliga att mäta eller visar sig vara en mycket dyr lösning. Så i denna avhandling undersöker vi möjligheten att tillhandahålla liknande lösning med hjälp av en enda sensor (instrumentbrädkamera) för att mäta flera variabler. Detta ger en hållbar lösning även när den skalas upp i stora flottor. Videoramar som kan samlas in från truckens visuella uppfattning (dvs. lastbilens inbyggda kamera) bearbetas av djupinlärningsteknikerna och operativa data kan extraher -as. Vissa tekniker som bildklassificering och semantiska segmenteringsutgång -ar experimenterades och visar potential att ersätta dyra hårdvaruprojekt som Lidar eller radarbaserade lösningar.
APA, Harvard, Vancouver, ISO, and other styles
13

Bigg, Daniel. "Unsupervised financial knowledge extraction." Available from the University of Aberdeen Library and Historic Collections Digital Resources. Online version available for University member only until Jan. 1, 2014, 2009. http://digitool.abdn.ac.uk:80/webclient/DeliveryManager?application=DIGITOOL-3&owner=resourcediscovery&custom_att_2=simple_viewer&pid=33589.

Full text
APA, Harvard, Vancouver, ISO, and other styles
14

Jungbluth, Adolfo, and Jon Li Yeng. "Quality data extraction methodology based on the labeling of coffee leaves with nutritional deficiencies." Association for Computing Machinery, 2018. http://hdl.handle.net/10757/624685.

Full text
Abstract:
El texto completo de este trabajo no está disponible en el Repositorio Académico UPC por restricciones de la casa editorial donde ha sido publicado.
Nutritional deficiencies detection for coffee leaves is a task which is often undertaken manually by experts on the field known as agronomists. The process they follow to carry this task is based on observation of the different characteristics of the coffee leaves while relying on their own experience. Visual fatigue and human error in this empiric approach cause leaves to be incorrectly labeled and thus affecting the quality of the data obtained. In this context, different crowdsourcing approaches can be applied to enhance the quality of the data extracted. These approaches separately propose the use of voting systems, association rule filters and evolutive learning. In this paper, we extend the use of association rule filters and evolutive approach by combining them in a methodology to enhance the quality of the data while guiding the users during the main stages of data extraction tasks. Moreover, our methodology proposes a reward component to engage users and keep them motivated during the crowdsourcing tasks. The extracted dataset by applying our proposed methodology in a case study on Peruvian coffee leaves resulted in 93.33% accuracy with 30 instances collected by 8 experts and evaluated by 2 agronomic engineers with background on coffee leaves. The accuracy of the dataset was higher than independently implementing the evolutive feedback strategy and an empiric approach which resulted in 86.67% and 70% accuracy respectively under the same conditions.
Revisión por pares
APA, Harvard, Vancouver, ISO, and other styles
15

Giess, Matthew. "Extracting information from manufacturing data using data mining methods." Thesis, University of Bath, 2006. https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.432831.

Full text
APA, Harvard, Vancouver, ISO, and other styles
16

Zhao, Zilong. "Extracting knowledge from macroeconomic data, images and unreliable data." Thesis, Université Grenoble Alpes, 2020. http://www.theses.fr/2020GRALT074.

Full text
Abstract:
L'identification de système et l'apprentissage automatique sont deux concepts similaires utilisés indépendamment dans la communauté automatique et informatique. L'identification des systèmes construit des modèles à partir de données mesurées. Les algorithmes d'apprentissage automatique construisent des modèles basés sur des données d'entraînement (propre ou non), afin de faire des prédictions sans être explicitement programmé pour le faire. Sauf la précision de prédiction, la vitesse de convergence et la stabilité sont deux autres facteurs clés pour évaluer le processus de l'apprentissage, en particulier dans le cas d'apprentissage en ligne, et ces propriétés ont déjà été bien étudiées en théorie du contrôle. Donc, cette thèse implémente des recherches suivantes : 1) Identification du système et contrôle optimal des données macroéconomiques : Nous modélisons d'abord les données macroéconomiques chinoises sur le modèle VAR (Vector Auto-Regression), puis identifions la relation de cointégration entre les variables et utilisons le Vector Error Correction Model (VECM) pour étudier le court terme fluctuations autour de l'équilibre à long terme, la causalité de Granger est également étudiée avec VECM. Ce travail révèle la tendance de la transition de la croissance économique de la Chine : de l'exportation vers la consommation ; La deuxième étude est avec des données de la France. On représente le modèle dans l'espace d'états, mettons le modèle dans un cadre de feedback-control, le contrôleur est conçu par un régulateur linéaire-quadratique (LQR). On peut également imposer des perturbations sur les sorties et des contraintes sur les entrées, ce qui simule la situation réelle de crise économique. 2) Utilisation de la théorie du contrôle pour améliorer l'apprentissage en ligne du réseau neuronal profond : Nous proposons un algorithme de taux d'apprentissage basé sur les performances : E (Exponential)/PD (Proportional Derivative) contrôle, qui considère le Convolutional Neural Network (CNN) comme une plante, taux d'apprentissage comme signal de commande et valeur de loss comme signal d'erreur. Le résultat montre que E/PD surpasse l'état de l'art en termes de précision finale, de loss finale et de vitesse de convergence, et le résultat est également plus stable. Cependant, une observation des expériences E/PD est que le taux d'apprentissage diminue tandis que la loss diminue continuellement. Mais la loss diminue, le modèle s’approche d’optimum, on ne devait pas diminuer le taux d'apprentissage. Pour éviter cela, nous proposons un event-based E/PD. Le résultat montre qu'il améliore E/PD en précision finale, loss finale et vitesse de convergence ; Une autre observation de l'expérience E/PD est que l'apprentissage en ligne fixe des époques constantes pour chaque batch. Puisque E/PD converge rapidement, l'amélioration significative ne vient que des époques initiales. Alors, nous proposons un autre event-based E/PD, qui inspecte la loss historique. Le résultat montre qu'il peut épargner jusqu'à 67% d'époques sur la donnée CIFAR-10 sans dégrader beaucoup les performances.3) Apprentissage automatique à partir de données non fiables : Nous proposons un cadre générique : Robust Anomaly Detector (RAD), la partie de sélection des données de RAD est un cadre à deux couches, où la première couche est utilisée pour filtrer les données suspectes, et la deuxième couche détecte les modèles d'anomalie à partir des données restantes. On dérive également trois variantes de RAD : voting, active learning et slim, qui utilisent des informations supplémentaires, par exempe, les opinions des classificateurs conflictuels et les requêtes d'oracles. Le résultat montre que RAD peut améliorer la performance du modèle en présence de bruit sur les étiquettes de données. Trois variations de RAD montrent qu'elles peuvent toutes améliorer le RAD original, et le RAD Active Learning fonctionne presque aussi bien que dans le cas où il n'y a pas de bruit sur les étiquettes
System identification and machine learning are two similar concepts independently used in automatic and computer science community. System identification uses statistical methods to build mathematical models of dynamical systems from measured data. Machine learning algorithms build a mathematical model based on sample data, known as "training data" (clean or not), in order to make predictions or decisions without being explicitly programmed to do so. Except prediction accuracy, converging speed and stability are another two key factors to evaluate the training process, especially in the online learning scenario, and these properties have already been well studied in control theory. Therefore, this thesis will implement the interdisciplinary researches for following topic: 1) System identification and optimal control on macroeconomic data: We first modelize the China macroeconomic data on Vector Auto-Regression (VAR) model, then identify the cointegration relation between variables and use Vector Error Correction Model (VECM) to study the short-time fluctuations around the long-term equilibrium, Granger Causality is also studied with VECM. This work reveals the trend of China's economic growth transition: from export-oriented to consumption-oriented; Due to limitation of China economic data, we turn to use France macroeconomic data in the second study. We represent the model in state-space, put the model into a feedback control framework, the controller is designed by Linear-Quadratic Regulator (LQR). The system can apply the control law to bring the system to a desired state. We can also impose perturbations on outputs and constraints on inputs, which emulates the real-world situation of economic crisis. Economists can observe the recovery trajectory of economy, which gives meaningful implications for policy-making. 2) Using control theory to improve the online learning of deep neural network: We propose a performance-based learning rate algorithm: E (Exponential)/PD (Proportional Derivative) feedback control, which consider the Convolutional Neural Network (CNN) as plant, learning rate as control signal and loss value as error signal. Results show that E/PD outperforms the state-of-the-art in final accuracy, final loss and converging speed, and the result are also more stable. However, one observation from E/PD experiments is that learning rate decreases while loss continuously decreases. But loss decreases mean model approaches optimum, we should not decrease the learning rate. To prevent this, we propose an event-based E/PD. Results show that it improves E/PD in final accuracy, final loss and converging speed; Another observation from E/PD experiment is that online learning fixes a constant training epoch for each batch. Since E/PD converges fast, the significant improvement only comes from the beginning epochs. Therefore, we propose another event-based E/PD, which inspects the historical loss, when the progress of training is lower than a certain threshold, we turn to next batch. Results show that it can save up to 67% epochs on CIFAR-10 dataset without degrading much performance. 3) Machine learning out of unreliable data: We propose a generic framework: Robust Anomaly Detector (RAD), The data selection part of RAD is a two-layer framework, where the first layer is used to filter out the suspicious data, and the second layer detects the anomaly patterns from the remaining data. We also derive three variations of RAD namely, voting, active learning and slim, which use additional information, e.g., opinions of conflicting classifiers and queries of oracles. We iteratively update the historical selected data to improve accumulated data quality. Results show that RAD can continuously improve model's performance under the presence of noise on labels. Three variations of RAD show they can all improve the original setting, and the RAD Active Learning performs almost as good as the case where there is no noise on labels
APA, Harvard, Vancouver, ISO, and other styles
17

Lee, Seungkyu Liu Yanxi. "Symmetry group extraction from multidimensional real data." [University Park, Pa.] : Pennsylvania State University, 2009. http://etda.libraries.psu.edu/theses/approved/WorldWideIndex/ETD-4720/index.html.

Full text
APA, Harvard, Vancouver, ISO, and other styles
18

King, Brent. "Automatic extraction of knowledge from design data." Thesis, University of Sunderland, 1995. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.307964.

Full text
APA, Harvard, Vancouver, ISO, and other styles
19

Guo, Jinsong. "Reducing human effort in web data extraction." Thesis, University of Oxford, 2017. http://ora.ox.ac.uk/objects/uuid:04bd39dd-bfec-4c07-91db-980fcbc745ba.

Full text
Abstract:
The human effort in large-scale web data extraction significantly affects both the extraction flexibility and the economic cost. Our work aims to reduce the human effort required by web data extraction tasks in three specific scenarios. (I) Data demand is unclear, and the user has to guide the wrapper induction by annotations. To maximally save the human effort in the annotation process, wrappers should be robust, i.e., immune to the webpage's change, to avoid the wrapper re-generation which requires a re-annotation process. Existing approaches primarily aim at generating accurate wrappers but barely generate robust wrappers. We prove that the XPATH wrapper induction problem is NP-hard, and propose an approximate solution estimating a set of top-k robust wrappers in polynomial time. Our method also meets one additional requirement that the induction process should be noise resistant, i.e., tolerate slightly erroneous examples. (II) Data demand is clear, and the user's guide should be avoided, i.e., the wrapper generation should be fully-unsupervised. Existing unsupervised methods purely relying on the repeated patterns of HTML structures/visual information are far from being practical. Partially supervised methods, such as the state-of-the-art system DIADEM, can work well for tasks involving only a small number of domains. However, the human effort in the annotator preparation process becomes a heavier burden when the domain number increases. We propose a new approach, called RED (abbreviation for 'redundancy'), an automatic approach exploiting content redundancy between the result page and its corresponding detail pages. RED requires no annotation (thus requires no human effort) and its wrapper accuracy is significantly higher than that of previous unsupervised methods. (III) Data quality is unknown, and the user's related decisions are blind. Without knowing the error types and the error number of each type in the extracted data, the extraction effort could be wasted on useless websites, and even worse, the human effort could be wasted on unnecessary or wrongly-targeted data cleaning process. Despite the importance of error estimation, no methods have addressed it sufficiently. We focus on two types of common errors in web data, namely duplicates and violations of integrity constraints. We propose a series of error estimation approaches by adapting, extending, and synthesizing some recent innovations in diverse areas such as active learning, classifier calibration, F-measure estimation, and interactive training.
APA, Harvard, Vancouver, ISO, and other styles
20

Yang, Hui. "Data extraction in holographic particle image velocimetry." Thesis, Loughborough University, 2004. https://dspace.lboro.ac.uk/2134/35012.

Full text
Abstract:
Holographic Particle Image Velocimetry (HPIV) is potentially the best technique to obtain instantaneous, three-dimensional, flow field information. Several researchers have presented their experimental results to demonstrate the power of HPIV technique. However, the challenge to find an economical and automatic means to extract and process the immense amount of data from the holograms still remains. This thesis reports on the development of complex amplitude correlation as a means of data extraction. At the same time, three-dimensional quantitative measurements for a micro scale flow is of increasing importance in the design of microfluidic devices. This thesis also reports the investigation of HPIV in micro-scale fluid flow. The author has re-examined complex amplitude correlation using a formulation of scalar diffraction in three-dimensional vector space.
APA, Harvard, Vancouver, ISO, and other styles
21

Rangaraj, Jithendra Kumar. "Knowledge-based Data Extraction Workbench for Eclipse." The Ohio State University, 2012. http://rave.ohiolink.edu/etdc/view?acc_num=osu1354290498.

Full text
APA, Harvard, Vancouver, ISO, and other styles
22

Ouahid, Hicham. "Data extraction from the Web using XML." Thesis, University of Ottawa (Canada), 2001. http://hdl.handle.net/10393/9260.

Full text
Abstract:
This thesis presents a mechanism based on eXtensible Markup Language (XML) to extract data from HTML-based Web pages and populate relational databases. This task is performed by a system called the XML-based Web Agent (XWA). The data extraction is done in three phases. First, the Web pages are converted to well-formed XML documents to facilitate their processing. Second, the data is extracted from the well-formed XML documents and formatted into valid XML documents. Finally, the valid XML documents are mapped into tables to be stored in a relational database. To extract specific data from the Web, the XWA requires information about the Web pages from which to extract the data, the location of the data within the Web pages, and how the extracted data should be formatted. This information is stored in Web Site Ontologies which are built using a language called the Web Ontology Description Language (WONDEL). WONDEL is based on XML and XML Pointer Language. It has been defined as a part of this work to allow users to specify the data they want, and let the XWA work offline to extract it and store it in a database. This has the advantage of saving users the time waiting for the Web pages to download, and taking benefit from the powerful query mechanism offered by database management systems.
APA, Harvard, Vancouver, ISO, and other styles
23

Bródka, Piotr. "Key User Extraction Based on Telecommunication Data." Thesis, Blekinge Tekniska Högskola, Sektionen för datavetenskap och kommunikation, 2012. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-5863.

Full text
Abstract:
The number of systems that collect vast amount of data about users rapidly grow during last few years. Many of these systems contain data not only about people characteristics but also about their relationships with other system users. From this kind of data it is possible to extract a social network that reflects the connections between system’s users. Moreover, the analysis of such social network enables to investigate different characteristics of its users and their linkages. One of the types of examining such network is key users extraction. Key users are these who have the biggest impact on other network users as well as have big influence on network evolution. The obtained knowledge about these users enables to investigate and predict changes within the network. So this knowledge is very important for the people or companies who make a profit from the network like telecommunication company. The second important issue is the ability to extract these users as quick as possible, i.e. developed the algorithm that will be time-effective in large social networks where number of nodes and edges is equal few millions.
APA, Harvard, Vancouver, ISO, and other styles
24

Murdoch, S. J. T. "Extracting speed signatures form gail data." Thesis, University of Strathclyde, 2011. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.549423.

Full text
APA, Harvard, Vancouver, ISO, and other styles
25

Menzel-Jones, Cian John. "Extracting molecular information from spectroscopic data." Thesis, University of British Columbia, 2014. http://hdl.handle.net/2429/51476.

Full text
Abstract:
This thesis explores new ways with which to utilize molecular spectroscopic data in both the time and frequency domain. Operating within the Born-Oppenheimer approximation (BOA), we show how to obtain the signs of transition-dipole amplitudes from fluorescence line intensities. Using the amplitudes thus obtained we give a method to extract highly accurate excited state potential(s) and the transition-dipole(s) as a function of the nuclear displacements. The procedure, illustrated here for the diatomic and triatomic molecules, is in principle applicable to any polyatomic system. We, also, extend this approach beyond the BOA and demonstrate applications involving bound-continuum transition, and double-minimum potentials. Furthermore, by using as input these measured energy level positions and the transition dipole moments (TDMs), we derive a scheme that completely determines the non-adiabatic coupling matrix between potential energy surfaces and the coordinate dependence of the coupling functions. We demonstrate results in a diatomic system with two spin-orbit coupled potentials, whereby experimentally measured information along with TDMs computed for two corresponding diabatic potentials to the fully spin-orbit coupled set of eigenstates, are used to extract the diagonal and off-diagonal spin-orbit coupling functions. Using time-resolved spectra, we show that bi-chromatic coherent control (BCC) enables the determination of the amplitudes (=magnitudes+phases) of individual transition-dipole matrix elements (TDMs) in these non-adiabatic coupling situation. The present use of BCC induces quantum interferences using two external laser fields to coherently deplete the population of different pairs of excited energy eigenstates. The BCC induced depletion is supplemented by the computation of the Fourier integral of the time-resolved fluorescence at the beat frequencies of the two states involved. The combination of BCC and Fourier transform enables the determination of the complex expansion coefficients of the wave packet in a basis of vibrational energy eigenstates, from simple spontaneous fluorescence data.
Science, Faculty of
Physics and Astronomy, Department of
Graduate
APA, Harvard, Vancouver, ISO, and other styles
26

Paidipally, Anoop Rao. "Dynamic Data Extraction and Data Visualization with Application to the Kentucky Mesonet." TopSCHOLAR®, 2012. http://digitalcommons.wku.edu/theses/1160.

Full text
Abstract:
There is a need to integrate large-scale database, high-performance computing engines and geographical information system technologies into a user-friendly web interface as a platform for data visualization and customized statistical analysis. We present some concepts and design ideas regarding dynamic data storage and extraction by making use of open-source computing and mapping technologies. We implemented our methods to the Kentucky Mesonet automated weather mapping workflow. The main components of the work flow includes a web based interface, a robust database and computing infrastructure designed for both general users and power users such as modelers and researchers.
APA, Harvard, Vancouver, ISO, and other styles
27

Seegmiller, Ray D., Greg C. Willden, Maria S. Araujo, Todd A. Newton, Ben A. Abbott, and William A. Malatesta. "Automation of Generalized Measurement Extraction from Telemetric Network Systems." International Foundation for Telemetering, 2012. http://hdl.handle.net/10150/581647.

Full text
Abstract:
ITC/USA 2012 Conference Proceedings / The Forty-Eighth Annual International Telemetering Conference and Technical Exhibition / October 22-25, 2012 / Town and Country Resort & Convention Center, San Diego, California
In telemetric network systems, data extraction is often an after-thought. The data description frequently changes throughout the program so that last minute modifications of the data extraction approach are often required. This paper presents an alternative approach in which automation of measurement extraction is supported. The central key is a formal declarative language that can be used to configure instrumentation devices as well as measurement extraction devices. The Metadata Description Language (MDL) defined by the integrated Network Enhanced Telemetry (iNET) program, augmented with a generalized measurement extraction approach, addresses this issue. This paper describes the TmNS Data Extractor Tool, as well as lessons learned from commercial systems, the iNET program and TMATS.
APA, Harvard, Vancouver, ISO, and other styles
28

Morsey, Mohamed. "Efficient Extraction and Query Benchmarking of Wikipedia Data." Doctoral thesis, Universitätsbibliothek Leipzig, 2014. http://nbn-resolving.de/urn:nbn:de:bsz:15-qucosa-130593.

Full text
Abstract:
Knowledge bases are playing an increasingly important role for integrating information between systems and over the Web. Today, most knowledge bases cover only specific domains, they are created by relatively small groups of knowledge engineers, and it is very cost intensive to keep them up-to-date as domains change. In parallel, Wikipedia has grown into one of the central knowledge sources of mankind and is maintained by thousands of contributors. The DBpedia (http://dbpedia.org) project makes use of this large collaboratively edited knowledge source by extracting structured content from it, interlinking it with other knowledge bases, and making the result publicly available. DBpedia had and has a great effect on the Web of Data and became a crystallization point for it. Furthermore, many companies and researchers use DBpedia and its public services to improve their applications and research approaches. However, the DBpedia release process is heavy-weight and the releases are sometimes based on several months old data. Hence, a strategy to keep DBpedia always in synchronization with Wikipedia is highly required. In this thesis we propose the DBpedia Live framework, which reads a continuous stream of updated Wikipedia articles, and processes it. DBpedia Live processes that stream on-the-fly to obtain RDF data and updates the DBpedia knowledge base with the newly extracted data. DBpedia Live also publishes the newly added/deleted facts in files, in order to enable synchronization between our DBpedia endpoint and other DBpedia mirrors. Moreover, the new DBpedia Live framework incorporates several significant features, e.g. abstract extraction, ontology changes, and changesets publication. Basically, knowledge bases, including DBpedia, are stored in triplestores in order to facilitate accessing and querying their respective data. Furthermore, the triplestores constitute the backbone of increasingly many Data Web applications. It is thus evident that the performance of those stores is mission critical for individual projects as well as for data integration on the Data Web in general. Consequently, it is of central importance during the implementation of any of these applications to have a clear picture of the weaknesses and strengths of current triplestore implementations. We introduce a generic SPARQL benchmark creation procedure, which we apply to the DBpedia knowledge base. Previous approaches often compared relational and triplestores and, thus, settled on measuring performance against a relational database which had been converted to RDF by using SQL-like queries. In contrast to those approaches, our benchmark is based on queries that were actually issued by humans and applications against existing RDF data not resembling a relational schema. Our generic procedure for benchmark creation is based on query-log mining, clustering and SPARQL feature analysis. We argue that a pure SPARQL benchmark is more useful to compare existing triplestores and provide results for the popular triplestore implementations Virtuoso, Sesame, Apache Jena-TDB, and BigOWLIM. The subsequent comparison of our results with other benchmark results indicates that the performance of triplestores is by far less homogeneous than suggested by previous benchmarks. Further, one of the crucial tasks when creating and maintaining knowledge bases is validating their facts and maintaining the quality of their inherent data. This task include several subtasks, and in thesis we address two of those major subtasks, specifically fact validation and provenance, and data quality The subtask fact validation and provenance aim at providing sources for these facts in order to ensure correctness and traceability of the provided knowledge This subtask is often addressed by human curators in a three-step process: issuing appropriate keyword queries for the statement to check using standard search engines, retrieving potentially relevant documents and screening those documents for relevant content. The drawbacks of this process are manifold. Most importantly, it is very time-consuming as the experts have to carry out several search processes and must often read several documents. We present DeFacto (Deep Fact Validation), which is an algorithm for validating facts by finding trustworthy sources for it on the Web. DeFacto aims to provide an effective way of validating facts by supplying the user with relevant excerpts of webpages as well as useful additional information including a score for the confidence DeFacto has in the correctness of the input fact. On the other hand the subtask of data quality maintenance aims at evaluating and continuously improving the quality of data of the knowledge bases. We present a methodology for assessing the quality of knowledge bases’ data, which comprises of a manual and a semi-automatic process. The first phase includes the detection of common quality problems and their representation in a quality problem taxonomy. In the manual process, the second phase comprises of the evaluation of a large number of individual resources, according to the quality problem taxonomy via crowdsourcing. This process is accompanied by a tool wherein a user assesses an individual resource and evaluates each fact for correctness. The semi-automatic process involves the generation and verification of schema axioms. We report the results obtained by applying this methodology to DBpedia.
APA, Harvard, Vancouver, ISO, and other styles
29

Lin, Qingfen. "Enhancement, Extraction, and Visualization of 3D Volume Data." Doctoral thesis, Linköping : Univ, 2003. http://www.bibl.liu.se/liupubl/disp/disp2003/tek824s.pdf.

Full text
APA, Harvard, Vancouver, ISO, and other styles
30

Palmer, David Donald. "Modeling uncertainty for information extraction from speech data /." Thesis, Connect to this title online; UW restricted, 2001. http://hdl.handle.net/1773/5834.

Full text
APA, Harvard, Vancouver, ISO, and other styles
31

Tao, Cui. "Schema Matching and Data Extraction over HTML Tables." Diss., CLICK HERE for online access, 2003. http://contentdm.lib.byu.edu/ETD/image/etd279.pdf.

Full text
APA, Harvard, Vancouver, ISO, and other styles
32

Gottlieb, Matthew. "Understanding malware autostart techniques with web data extraction /." Online version of thesis, 2009. http://hdl.handle.net/1850/10632.

Full text
APA, Harvard, Vancouver, ISO, and other styles
33

Laidlaw, David H. Barr Alan H. "Geometric model extraction from magnetic resonance volume data /." Diss., Pasadena, Calif. : California Institute of Technology, 1995. http://resolver.caltech.edu/CaltechETD:etd-10152007-132141.

Full text
APA, Harvard, Vancouver, ISO, and other styles
34

Pham, Nam Wilamowski Bogdan M. "Data extraction from servers by the Internet Robot." Auburn, Ala, 2009. http://hdl.handle.net/10415/1781.

Full text
APA, Harvard, Vancouver, ISO, and other styles
35

Cheung, Jarvis T. "Representation and extraction of trends from process data." Thesis, Massachusetts Institute of Technology, 1992. http://hdl.handle.net/1721.1/13186.

Full text
APA, Harvard, Vancouver, ISO, and other styles
36

Stachowiak, Maciej 1976. "Automated extraction of structured data from HTML documents." Thesis, Massachusetts Institute of Technology, 1998. http://hdl.handle.net/1721.1/9896.

Full text
Abstract:
Thesis (M.Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1998.
Includes bibliographical references (leaf 45).
by Maciej Stachowiak.
M.Eng.
APA, Harvard, Vancouver, ISO, and other styles
37

Lazzarini, Nicola. "Knowledge extraction from biomedical data using machine learning." Thesis, University of Newcastle upon Tyne, 2017. http://hdl.handle.net/10443/3839.

Full text
Abstract:
Thanks to the breakthroughs in biotechnologies that have occurred during the recent years, biomedical data is accumulating at a previously unseen pace. In the field of biomedicine, decades-old statistical methods are still commonly used to analyse such data. However, the simplicity of these approaches often limits the amount of useful information that can be extracted from the data. Machine learning methods represent an important alternative due to their ability to capture complex patterns, within the data, likely missed by simpler methods. This thesis focuses on the extraction of useful knowledge from biomedical data using machine learning. Within the biomedical context, the vast majority of machine learning applications focus their e↵ort on the generation and validation of prediction models. Rarely the inferred models are used to discover meaningful biomedical knowledge. The work presented in this thesis goes beyond this scenario and devises new methodologies to mine machine learning models for the extraction of useful knowledge. The thesis targets two important and challenging biomedical analytic tasks: (1) the inference of biological networks and (2) the discovery of biomarkers. The first task aims to identify associations between di↵erent biological entities, while the second one tries to discover sets of variables that are relevant for specific biomedical conditions. Successful solutions for both problems rely on the ability to recognise complex interactions within the data, hence the use of multivariate machine learning methods. The network inference problem is addressed with FuNeL: a protocol to generate networks based on the analysis of rule-based machine learning models. The second task, the biomarker discovery, is studied with RGIFE, a heuristic that exploits the information extracted from machine learning models to guide its search for minimal subsets of variables. The extensive analysis conducted for this dissertation shows that the networks inferred with FuNeL capture relevant knowledge complementary to that extracted by standard inference methods. Furthermore, the associations defined by FuNeL are discovered - 6 - more pertinent in a disease context. The biomarkers selected by RGIFE are found to be disease-relevant and to have a high predictive power. When applied to osteoarthritis data, RGIFE confirmed the importance of previously identified biomarkers, whilst also extracting novel biomarkers with possible future clinical applications. Overall, the thesis shows new e↵ective methods to leverage the information, often remaining buried, encapsulated within machine learning models and discover useful biomedical knowledge.
APA, Harvard, Vancouver, ISO, and other styles
38

Novelli, Noël. "Extraction de dépendances fonctionnetitre : Une approche Data Mining." Aix-Marseille 2, 2000. http://www.theses.fr/2000AIX22071.

Full text
APA, Harvard, Vancouver, ISO, and other styles
39

Jiang, Ji Chu. "High Precision Deep Learning-Based Tabular Data Extraction." Thesis, Université d'Ottawa / University of Ottawa, 2021. http://hdl.handle.net/10393/41699.

Full text
Abstract:
The advancements of AI methodologies and computing power enables automation and propels the Industry 4.0 phenomenon. Information and data are digitized more than ever, millions of documents are being processed every day, they are fueled by the growth in institutions, organizations, and their supply chains. Processing documents is a time consuming laborious task. Therefore automating data processing is a highly important task for optimizing supply chains efficiency across all industries. Document analysis for data extraction is an impactful field, this thesis aims to achieve the vital steps in an ideal data extraction pipeline. Data is often stored in tables since it is a structured formats and the user can easily associate values and attributes. Tables can contain vital information from specifications, dimensions, cost etc. Therefore focusing on table analysis and recognition in documents is a cornerstone to data extraction. This thesis applies deep learning methodologies for automating the two main problems within table analysis for data extraction; table detection and table structure detection. Table detection is identifying and localizing the boundaries of the table. The output of the table detection model will be inputted into the table structure detection model for structure format analysis. Therefore the output of the table detection model must have high localization performance otherwise it would affect the rest of the data extraction pipeline. Our table detection improves bounding box localization performance by incorporating a Kullback–Leibler loss function that calculates the divergence between the probabilistic distribution between ground truth and predicted bounding boxes. As well as adding a voting procedure into the non-maximum suppression step to produce better localized merged bounding box proposals. This model improved precision of tabular detection by 1.2% while achieving the same recall as other state-of-the-art models on the public ICDAR2013 dataset. While also achieving state-of-the-art results of 99.8% precision on the ICDAR2017 dataset. Furthermore, our model showed huge improvements espcially at higher intersection over union (IoU) thresholds; at 95% IoU an improvement of 10.9% can be seen for ICDAR2013 dataset and an improvement of 8.4% can be seen for ICDAR2017 dataset. Table structure detection is recognizing the internal layout of a table. Often times researchers approach this through detecting the rows and columns. However, in order for correct mapping of each individual cell data location in the semantic extraction step the rows and columns would have to be combined and form a matrix, this introduces additional degrees of error. Alternatively we propose a model that directly detects each individual cell. Our model is an ensemble of state-of-the-art models; Hybird Task Cascade as the detector and dual ResNeXt101 backbones arranged in a CBNet architecture. There is a lack of quality labeled data for table cell structure detection, therefore we hand labeled the ICDAR2013 dataset, and we wish to establish a strong baseline for this dataset. Our model was compared with other state-of-the-art models that excelled at table or table structure detection. Our model yielded a precision of 89.2% and recall of 98.7% on the ICDAR2013 cell structure dataset.
APA, Harvard, Vancouver, ISO, and other styles
40

Einstein, Noah. "SmartHub: Manual Wheelchair Data Extraction and Processing Device." The Ohio State University, 2019. http://rave.ohiolink.edu/etdc/view?acc_num=osu1555352793977171.

Full text
APA, Harvard, Vancouver, ISO, and other styles
41

Müglich, Marcel. "Motion Feature Extraction of Video and Movie Data." Thesis, KTH, Numerisk analys, NA, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-214030.

Full text
Abstract:
Since the Video on Demand market grows at a fast rate in terms of available content and user numbers, the task arises to match personal relevant content to each individual user. This problem is tackled by implementing a recommondation system which finds relevant content by automatically detecting patterns in the individual user’s behaviour. To find such patterns, either collaborative filtering, which evaluates patterns of user groups to draw conclusions about a single user’s preferences, or content based strategies can be applied. Those content strategies analyze the watched movies of the individual user and extract quantifiable information from them. This information can be utilized to find relevant movies with similar features. The focus of this thesis lies on the extraction of motion features from movie and video data. Three feature extraction methods are presented and evaluated which classify camera movement, estimate the motion intensity and detect film transitions.
VOD-marknaden (Video på begäran) är en växande marknad, dels i mängden tillgängligt innehåll samt till antalet användare. Det skapar en utmaning att matcha personligt relevant innehåll för varje enskild användare. Utmaningen hanteras genom att implementera ett rekommendationssystem som hittar relevant innehåll genom att automatiskt identifiera mönster i varje användaren beteende. För att hitta sådana mönster används i vanliga fall Collaborative filtering; som utvärderar mönster utifrån grupper av flera användare och kors- rekommenderar produkter mellan dem utan att ta nämnvärd hänsyn till produktens innehåll. (De som har köpt X har också köpt Y) Ett alternativ till detta är att tillämpa en innehållsbaserad strategi. Innehållsbaserade strategier analyserar den faktiska video-datan i de produkter som har konsumerats av en enskild användare med syfte att därifrån extrahera kvantifierbar information. Denna information kan användas för att hitta relevanta filmer med liknande videoinnehåll. Inriktningen för denna avhandling berör utvinning av kamerarörelsevektorer från film- och videodata. Tre extraktionsmetoder presenteras och utvärderas för att klassificera kamerans rörelse, kamerarörelsen intensitet och för att detektera scenbyten.
APA, Harvard, Vancouver, ISO, and other styles
42

García-Martín, Eva. "Extraction and Energy Efficient Processing of Streaming Data." Licentiate thesis, Blekinge Tekniska Högskola, Institutionen för datalogi och datorsystemteknik, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-15532.

Full text
Abstract:
The interest in machine learning algorithms is increasing, in parallel with the advancements in hardware and software required to mine large-scale datasets. Machine learning algorithms account for a significant amount of energy consumed in data centers, which impacts the global energy consumption. However, machine learning algorithms are optimized towards predictive performance and scalability. Algorithms with low energy consumption are necessary for embedded systems and other resource constrained devices; and desirable for platforms that require many computations, such as data centers. Data stream mining investigates how to process potentially infinite streams of data without the need to store all the data. This ability is particularly useful for companies that are generating data at a high rate, such as social networks. This thesis investigates algorithms in the data stream mining domain from an energy efficiency perspective. The thesis comprises of two parts. The first part explores how to extract and analyze data from Twitter, with a pilot study that investigates a correlation between hashtags and followers. The second and main part investigates how energy is consumed and optimized in an online learning algorithm, suitable for data stream mining tasks. The second part of the thesis focuses on analyzing, understanding, and reformulating the Very Fast Decision Tree (VFDT) algorithm, the original Hoeffding tree algorithm, into an energy efficient version. It presents three key contributions. First, it shows how energy varies in the VFDT from a high-level view by tuning different parameters. Second, it presents a methodology to identify energy bottlenecks in machine learning algorithms, by portraying the functions of the VFDT that consume the largest amount of energy. Third, it introduces dynamic parameter adaptation for Hoeffding trees, a method to dynamically adapt the parameters of Hoeffding trees to reduce their energy consumption. The results show an average energy reduction of 23% on the VFDT algorithm.
Scalable resource-efficient systems for big data analytics
APA, Harvard, Vancouver, ISO, and other styles
43

Nziga, Jean-Pierre. "Incremental Sparse-PCA Feature Extraction For Data Streams." NSUWorks, 2015. http://nsuworks.nova.edu/gscis_etd/365.

Full text
Abstract:
Intruders attempt to penetrate commercial systems daily and cause considerable financial losses for individuals and organizations. Intrusion detection systems monitor network events to detect computer security threats. An extensive amount of network data is devoted to detecting malicious activities. Storing, processing, and analyzing the massive volume of data is costly and indicate the need to find efficient methods to perform network data reduction that does not require the data to be first captured and stored. A better approach allows the extraction of useful variables from data streams in real time and in a single pass. The removal of irrelevant attributes reduces the data to be fed to the intrusion detection system (IDS) and shortens the analysis time while improving the classification accuracy. This dissertation introduces an online, real time, data processing method for knowledge extraction. This incremental feature extraction is based on two approaches. First, Chunk Incremental Principal Component Analysis (CIPCA) detects intrusion in data streams. Then, two novel incremental feature extraction methods, Incremental Structured Sparse PCA (ISSPCA) and Incremental Generalized Power Method Sparse PCA (IGSPCA), find malicious elements. Metrics helped compare the performance of all methods. The IGSPCA was found to perform as well as or better than CIPCA overall in term of dimensionality reduction, classification accuracy, and learning time. ISSPCA yielded better results for higher chunk values and greater accumulation ratio thresholds. CIPCA and IGSPCA reduced the IDS dataset to 10 principal components as opposed to 14 eigenvectors for ISSPCA. ISSPCA is more expensive in terms of learning time in comparison to the other techniques. This dissertation presents new methods that perform feature extraction from continuous data streams to find the small number of features necessary to express the most data variance. Data subsets derived from a few important variables render their interpretation easier. Another goal of this dissertation was to propose incremental sparse PCA algorithms capable to process data with concept drift and concept shift. Experiments using WaveForm and WaveFormNoise datasets confirmed this ability. Similar to CIPCA, the ISSPCA and IGSPCA updated eigen-axes as a function of the accumulation ratio value, forming informative eigenspace with few eigenvectors.
APA, Harvard, Vancouver, ISO, and other styles
44

Xiang, Deliang. "Urban Area Information Extraction From Polarimetric SAR Data." Doctoral thesis, KTH, Skolan för arkitektur och samhällsbyggnad (ABE), 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-187951.

Full text
Abstract:
Polarimetric Synthetic Aperture Radar (PolSAR) has been used for various remote sensing applications since more information could be obtained in multiple polarizations. The overall objective of this thesis is to investigate urban area information extraction from PolSAR data with the following specific objectives: (1) to exploit polarimetric scattering model-based decomposition methods for urban areas, (2) to investigate effective methods for man-made target detection, (3) to develop edge detection and superpixel generation methods, and (4) to investigate urban area classification and segmentation. Paper 1 proposes a new scattering coherency matrix to model the cross-polarized scattering component from urban areas, which adaptively considers the polarization orientation angles of buildings. Thus, the HV scattering components from forests and oriented urban areas can be modelled respectively. Paper 2 presents two urban area decompositions using this scattering model. After the decomposition, urban scattering components can be effectively extracted. Paper 3 presents an improved man-made target detection method for PolSAR data based on nonstationarity and asymmetry. Reflection asymmetry was incorporate into the azimuth nonstationarity extraction method to improve the man-made target detection accuracy, i.e., removing the natural areas and detecting the small targets. In Paper 4, the edge detection of PolSAR data was investigated using SIRV model and Gauss-shaped filter. This detector can locate the edge pixels accurately with fewer omissions. This could be useful for speckle noise reduction, superpixel generation and others. Paper 5 investigates an unsupervised classification method for PolSAR data in urban areas. The ortho and oriented buildings can be discriminated very well. Paper 6 proposes an adaptive superpixel generation method for PolSAR images. The algorithm produces compact superpixels that can well adhere to image boundaries in both natural and urban areas.
Polarimetriska Synthetic Aperture Radar (PolSAR) har använts för olika fjärranalystillämpningar för, eftersom mer information kan erhållas från multipolarisad data. Det övergripande syftet med denna avhandling är att undersöka informationshämtning över urbana områden från PolSAR data med följande särskilda mål: (1) att utnyttja polarimetrisk spridningsmodellbaserade nedbrytningsmetoder för stadsområden, (2) att undersöka effektiva metoder för upptäckt av konstgjorda objekt, (3) att utveckla metoder som kantavkänning och superpixel generation, och (4) för att undersöka klassificering och segmentering av stadsområden. Artikel 1 föreslår en ny spridnings-koherens matris för att modellera korspolariserade spridningskomponent från tätorter, som adaptivt utvärderar polariseringsorienteringsvinkel av byggnader. Artikel 2 presenterar nedbrytningstekniken över två urbana områden med hjälp av denna spridningsmodell. Efter nedbrytningen kunde urbana spridningskomponenter effektivt extraheras. Artikel 3 presenterar en förbättrad detekteringsmetod för konstgjorda mål med PolSAR data baserade på icke-stationaritet och asymmetri. integrerades reflektionsasymmetri i icke-stationaritetsmetoden för att förbättra noggrannheten i upptäckten av konstgjorda föremål, dvs. att ta bort naturområden och upptäcka de små föremålen. I artikel 4 undersöktes kantdetektering av PolSAR data med hjälp av SIRV modell och ett Gauss-formad filter. Denna detektor kan hitta kantpixlarna noggrant med mindre utelämnande. Detta skulle den vara användbar för reduktion av brus, superpixel generation och andra. Artikel 5 utforskar en oövervakad klassificeringsmetod av PolSAR data över stadsområden. Orto- och orienterade byggnader kan särskiljas mycket väl. Baserat på artikel 4 föreslår artikel 6 en adaptiv superpixel generationensmetod för PolSAR data. Algoritmen producerar kompakta superpixels som kan kommer att följa bildgränser i både naturliga och stadsområden.

QC 20160607

APA, Harvard, Vancouver, ISO, and other styles
45

Alves, Ricardo João de Freitas. "Declarative approach to data extraction of web pages." Master's thesis, Faculdade de Ciências e Tecnologia, 2009. http://hdl.handle.net/10362/5822.

Full text
Abstract:
Thesis submitted to Faculdade de Ciências e Tecnologia of the Universidade Nova de Lisboa, in partial fulfilment of the requirements for the degree of Master in Computer Science
In the last few years, we have been witnessing a noticeable WEB evolution with the introduction of significant improvements at technological level, such as the emergence of XHTML, CSS,Javascript, and Web2.0, just to name ones. This, combined with other factors such as physical expansion of the Web, as well as its low cost, have been the great motivator for the organizations and the general public to join, with a consequent growth in the number of users and thus influencing the volume of the largest global data repository. In consequence, there was an increasing need for regular data acquisition from the WEB, and because of its frequency, length or complexity, it would only be viable to obtain through automatic extractors. However, two main difficulties are inherent to automatic extractors. First, much of the Web's information is presented in visual formats mainly directed for human reading. Secondly, the introduction of dynamic webpages, which are brought together in local memory from different sources, causing some pages not to have a source file. Therefore, this thesis proposes a new and more modern extractor, capable of supporting the Web evolution, as well as being generic, so as to be able to be used in any situation, and capable of being extended and easily adaptable to a more particular use. This project is an extension of an earlier one which had the capability of extractions on semi-structured text files. However it evolved to a modular extraction system capable of extracting data from webpages, semi-structured text files and be expanded to support other data source types. It also contains a more complete and generic validation system and a new data delivery system capable of performing the earlier deliveries as well as new generic ones. A graphical editor was also developed to support the extraction system features and to allow a domain expert without computer knowledge to create extractions with only a few simple and intuitive interactions on the rendered webpage.
APA, Harvard, Vancouver, ISO, and other styles
46

Wessman, Alan E. "A Framework for Extraction Plans and Heuristics in an Ontology-Based Data-Extraction System." Diss., CLICK HERE for online access, 2005. http://contentdm.lib.byu.edu/ETD/image/etd684.pdf.

Full text
APA, Harvard, Vancouver, ISO, and other styles
47

Chartrand, Timothy Adam. "Ontology-Based Extraction of RDF Data from the World Wide Web." BYU ScholarsArchive, 2003. https://scholarsarchive.byu.edu/etd/56.

Full text
Abstract:
The simplicity and proliferation of the World Wide Web (WWW) has taken the availability of information to an unprecedented level. The next generation of the Web, the Semantic Web, seeks to make information more usable by machines by introducing a more rigorous structure based on ontologies. One hinderance to the Semantic Web is the lack of existing semantically marked-up data. Until there is a critical mass of Semantic Web data, few people will develop and use Semantic Web applications. This project helps promote the Semantic Web by providing content. We apply existing information-extraction techniques, in particular, the BYU ontologybased data-extraction system, to extract information from the WWW based on a Semantic Web ontology to produce Semantic Web data with respect to that ontology. As an example of how the generated Semantic Web data can be used, we provide an application to browse the extracted data and the source documents together. In this sense, the extracted data is superimposed over or is an index over the source documents. Our experiments with ontologies in four application domains show that our approach can indeed extract Semantic Web data from the WWW with precision and recall similar to that achieved by the underlying information extraction system and make that data accessible to Semantic Web applications.
APA, Harvard, Vancouver, ISO, and other styles
48

Selig, Henny. "Continuous Event Log Extraction for Process Mining." Thesis, KTH, Skolan för informations- och kommunikationsteknik (ICT), 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-210710.

Full text
Abstract:
Process mining is the application of data science technologies on transactional business data to identify or monitor processes within an organization. The analyzed data often originates from process-unaware enterprise software, e.g. Enterprise Resource Planning (ERP) systems. The differences in data management between ERP and process mining systems result in a large fraction of ambiguous cases, affected by convergence and divergence. The consequence is a chasm between the process as interpreted by process mining, and the process as executed in the ERP system. In this thesis, a purchasing process of an SAP ERP system is used to demonstrate, how ERP data can be extracted and transformed into a process mining event log that expresses ambiguous cases as accurately as possible. As the content and structure of the event log already define the scope (i.e. which process) and granularity (i.e. activity types), the process mining results depend on the event log quality. The results of this thesis show how the consideration of case attributes, the notion of a case and the granularity of events can be used to manage the event log quality. The proposed solution supports continuous event extraction from the ERP system.
Process mining är användningen av datavetenskaplig teknik för transaktionsdata, för att identifiera eller övervaka processer inom en organisation. Analyserade data härstammar ofta från processomedvetna företagsprogramvaror, såsom SAP-system, vilka är centrerade kring affärsdokumentation. Skillnaderna i data management mellan Enterprise Resource Planning (ERP)och process mining-system resulterar i en stor andel tvetydiga fall, vilka påverkas av konvergens och divergens. Detta resulterar i ett gap mellan processen som tolkas av process mining och processen som exekveras i ERP-systemet. I denna uppsats används en inköpsprocess för ett SAP ERP-system för att visa hur ERP-data kan extraheras och omvandlas till en process mining-orienterad händelselogg som uttrycker tvetydiga fall så precist som möjligt. Eftersom innehållet och strukturen hos händelseloggen redan definierar omfattningen (vilken process) och granularitet (aktivitetstyperna), så beror resultatet av process mining på kvalitén av händelseloggen. Resultaten av denna uppsats visar hur definitioner av typfall och händelsens granularitet kan användas för att förbättra kvalitén. Den beskrivna lösningen stöder kontinuerlig händelseloggsextraktion från ERPsystemet.
APA, Harvard, Vancouver, ISO, and other styles
49

Lord, Dale, and Kurt Kosbar. "An Architecture for Sensor Data Fusion to Reduce Data Transmission Bandwidth." International Foundation for Telemetering, 2004. http://hdl.handle.net/10150/605790.

Full text
Abstract:
International Telemetering Conference Proceedings / October 18-21, 2004 / Town & Country Resort, San Diego, California
Sensor networks can demand large amounts of bandwidth if the raw sensor data is transferred to a central location. Feature recognition and sensor fusion algorithms can reduce this bandwidth. Unfortunately the designers of the system, having not yet seen the data which will be collected, may not know which algorithms should be used at the time the system is first installed. This paper describes a flexible architecture which allows the deployment of data reduction algorithms throughout the network while the system is in service. The network of sensors approach not only allows for signal processing to be pushed closer to the sensor, but helps accommodate extensions to the system in a very efficient and structured manner.
APA, Harvard, Vancouver, ISO, and other styles
50

Lord, Dale. "Relational Database for Visual Data Management." International Foundation for Telemetering, 2005. http://hdl.handle.net/10150/604893.

Full text
Abstract:
ITC/USA 2005 Conference Proceedings / The Forty-First Annual International Telemetering Conference and Technical Exhibition / October 24-27, 2005 / Riviera Hotel & Convention Center, Las Vegas, Nevada
Often it is necessary to retrieve segments of video with certain characteristics, or features, from a large archive of footage. This paper discusses how image processing algorithms can be used to automatically create a relational database, which indexes the video archive. This feature extraction can be performed either upon acquisition or in post processing. The database can then be queried to quickly locate and recover video segments with certain specified key features
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography