Tesi sul tema "Metadata mining"

Segui questo link per vedere altri tipi di pubblicazioni sul tema: Metadata mining.

Cita una fonte nei formati APA, MLA, Chicago, Harvard e in molti altri stili

Scegli il tipo di fonte:

Vedi i top-22 saggi (tesi di laurea o di dottorato) per l'attività di ricerca sul tema "Metadata mining".

Accanto a ogni fonte nell'elenco di riferimenti c'è un pulsante "Aggiungi alla bibliografia". Premilo e genereremo automaticamente la citazione bibliografica dell'opera scelta nello stile citazionale di cui hai bisogno: APA, MLA, Harvard, Chicago, Vancouver ecc.

Puoi anche scaricare il testo completo della pubblicazione scientifica nel formato .pdf e leggere online l'abstract (il sommario) dell'opera se è presente nei metadati.

Vedi le tesi di molte aree scientifiche e compila una bibliografia corretta.

1

Demšar, Urška. "Exploring geographical metadata by automatic and visual data mining". Licentiate thesis, KTH, Infrastructure, 2004. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-1779.

Testo completo
Abstract (sommario):

Metadata are data about data. They describe characteristicsand content of an original piece of data. Geographical metadatadescribe geospatial data: maps, satellite images and othergeographically referenced material. Such metadata have twocharacteristics, high dimensionality and diversity of attributedata types, which present a problem for traditional data miningalgorithms.

Other problems that arise during the exploration ofgeographical metadata are linked to the expertise of the userperforming the analysis. The large amounts of metadata andhundreds of possible attributes limit the exploration for anon-expert user, which results in a potential loss ofinformation that is hidden in metadata.

In order to solve some of these problems, this thesispresents an approach for exploration of geographical metadataby a combination of automatic and visual data mining.

Visual data mining is a principle that involves the human inthe data exploration by presenting the data in some visualform, allowing the human to get insight into the data and torecognise patterns. The main advantages of visual dataexploration over automatic data mining are that the visualexploration allows a direct interaction with the user, that itis intuitive and does not require complex understanding ofmathematical or statistical algorithms. As a result the userhas a higher confidence in the resulting patterns than if theywere produced by computer only.

In the thesis we present the Visual data mining tool (VDMtool), which was developed for exploration of geographicalmetadata for site planning. The tool provides five differentvisualisations: a histogram, a table, a pie chart, a parallelcoordinates visualisation and a clustering visualisation. Thevisualisations are connected using the interactive selectionprinciple called brushing and linking.

In the VDM tool the visual data mining concept is integratedwith an automatic data mining method, clustering, which finds ahierarchical structure in the metadata, based on similarity ofmetadata items. In the thesis we present a visualisation of thehierarchical structure in the form of a snowflake graph.

Keywords:visualisation, data mining, clustering, treedrawing, geographical metadata.

Gli stili APA, Harvard, Vancouver, ISO e altri
2

Tang, Yaobin. "Butterfly -- A model of provenance". Worcester, Mass. : Worcester Polytechnic Institute, 2009. http://www.wpi.edu/Pubs/ETD/Available/etd-031309-095511/.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
3

Ramakrishnan, Cartic. "Extracting, Representing and Mining Semantic Metadata from Text: Facilitating Knowledge Discovery in Biomedicine". Wright State University / OhioLINK, 2008. http://rave.ohiolink.edu/etdc/view?acc_num=wright1222021939.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
4

Dong, Zheng. "Automated Extraction and Retrieval of Metadata by Data Mining : a Case Study of Mining Engine for National Land Survey Sweden". Thesis, University of Gävle, Department of Technology and Built Environment, 2010. http://urn.kb.se/resolve?urn=urn:nbn:se:hig:diva-6811.

Testo completo
Abstract (sommario):

Metadata is the important information describing geographical data resources and their key elements. It is used to guarantee the availability and accessibility of the data. ISO 19115 is a metadata standard for geographical information, making the geographical metadata shareable, retrievable, and understandable at the global level. In order to cope with the massive, high-dimensional and high-diversity nature of geographical data, data mining is an applicable method to discover the metadata.

This thesis develops and evaluates an automated mining method for extracting metadata from the data environment on the Local Area Network at the National Land Survey of Sweden (NLS). These metadata are prepared and provided across Europe according to the metadata implementing rules for the Infrastructure for Spatial Information in Europe (INSPIRE). The metadata elements are defined according to the numerical formats of four different data entities: document data, time-series data, webpage data, and spatial data. For evaluating the method for further improvement, a few attributes and corresponding metadata of geographical data files are extracted automatically as metadata record in testing, and arranged in database. Based on the extracted metadata schema, a retrieving functionality is used to find the file containing the keyword of metadata user input. In general, the average success rate of metadata extraction and retrieval is 90.0%.

The mining engine is developed in C# programming language on top of the database using SQL Server 2005. Lucene.net is also integrated with Visual Studio 2005 to build an indexing framework for extracting and accessing metadata in database.

Gli stili APA, Harvard, Vancouver, ISO e altri
5

Al-Natsheh, Hussein. "Text Mining Approaches for Semantic Similarity Exploration and Metadata Enrichment of Scientific Digital Libraries". Thesis, Lyon, 2019. http://www.theses.fr/2019LYSE2062.

Testo completo
Abstract (sommario):
Pour les scientifiques et chercheurs, s’assurer que la connaissance est accessible pour pouvoir être réutilisée et développée est un point crucial. De plus, la façon dont nous stockons et gérons les articles scientifiques et leurs métadonnées dans les bibliothèques numériques détermine la quantité d’articles pertinents que nous pouvons découvrir et auxquels nous pouvons accéder en fonction de la signification réelle d’une requête de recherche. Cependant, sommes-nous en mesure d’explorer tous les documents scientifiques sémantiquement pertinents avec les systèmes existants de recherche d’information au moyen de mots-clés ? Il s’agit là de la question essentielle abordée dans cette thèse. L’objectif principal de nos travaux est d’élargir ou développer le spectre des connaissances des chercheurs travaillant dans un domaine interdisciplinaire lorsqu’ils utilisent les systèmes de recherche d’information des bibliothèques numériques multidisciplinaires. Le problème se pose cependant lorsque de tels chercheurs utilisent des mots-clés de recherche dépendant de la communauté dont ils sont issus alors que d’autres termes scientifiques sont attribués à des concepts pertinents lorsqu’ils sont utilisés dans des communautés de recherche différentes. Afin de proposer une solution à cette tâche d’exploration sémantique dans des bibliothèques numériques multidisciplinaires, nous avons appliqué plusieurs approches de fouille de texte. Tout d’abord, nous avons étudié la représentation sémantique des mots, des phrases, des paragraphes et des documents pour une meilleure estimation de la similarité sémantique. Ensuite, nous avons utilisé les informations sémantiques des mots dans des bases de données lexicales et des graphes de connaissance afin d’améliorer notre approche sémantique. En outre, la thèse présente quelques implémentations de cas d’utilisation du modèle que nous avons proposé
For scientists and researchers, it is very critical to ensure knowledge is accessible for re-use and development. Moreover, the way we store and manage scientific articles and their metadata in digital libraries determines the amount of relevant articles we can discover and access depending on what is actually meant in a search query. Yet, are we able to explore all semantically relevant scientific documents with the existing keyword-based search information retrieval systems? This is the primary question addressed in this thesis. Hence, the main purpose of our work is to broaden or expand the knowledge spectrum of researchers working in an interdisciplinary domain when they use the information retrieval systems of multidisciplinary digital libraries. However, the problem raises when such researchers use community-dependent search keywords while other scientific names given to relevant concepts are being used in a different research community.Towards proposing a solution to this semantic exploration task in multidisciplinary digital libraries, we applied several text mining approaches. First, we studied the semantic representation of words, sentences, paragraphs and documents for better semantic similarity estimation. In addition, we utilized the semantic information of words in lexical databases and knowledge graphs in order to enhance our semantic approach. Furthermore, the thesis presents a couple of use-case implementations of our proposed model
Gli stili APA, Harvard, Vancouver, ISO e altri
6

Petersson, Andreas. "Data mining file sharing metadata : A comparison between Random Forests Classificiation and Bayesian Networks". Thesis, Högskolan i Skövde, Institutionen för informationsteknologi, 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:his:diva-11180.

Testo completo
Abstract (sommario):
In this comparative study based on experimentation it is demonstrated that the two evaluated machine learning techniques, Bayesian networks and random forests, have similar predictive power in the domain of classifying torrents on BitTorrent file sharing networks. This work was performed in two steps. First, a literature analysis was performed to gain insight into how the two techniques work and what types of attacks exist against BitTorrent file sharing networks. After the literature analysis, an experiment was performed to evaluate the accuracy of the two techniques. The results show no significant advantage of using one algorithm over the other when only considering accuracy. However, ease of use lies in Random forests’ favour because the technique requires little pre-processing of the data and still generates accurate results with few false positives.
Gli stili APA, Harvard, Vancouver, ISO e altri
7

Petersson, Andreas. "Data mining file sharing metadata : A comparison between Random Forests Classification and Bayesian Networks". Thesis, Högskolan i Skövde, Institutionen för informationsteknologi, 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:his:diva-11285.

Testo completo
Abstract (sommario):
In this comparative study based on experimentation it is demonstrated that the two evaluated machine learning techniques, Bayesian networks and random forests, have similar predictive power in the domain of classifying torrents on BitTorrent file sharing networks.This work was performed in two steps. First, a literature analysis was performed to gain insight into how the two techniques work and what types of attacks exist against BitTorrent file sharing networks. After the literature analysis, an experiment was performed to evaluate the accuracy of the two techniques.The results show no significant advantage of using one algorithm over the other when only considering accuracy. However, ease of use lies in Random forests’ favour because the technique requires little pre-processing of the data and still generates accurate results with few false positives.
Gli stili APA, Harvard, Vancouver, ISO e altri
8

Ferrill, Paul. "REFERENCE DESIGN FOR A SQUADRON LEVEL DATA ARCHIVAL SYSTEM". International Foundation for Telemetering, 2006. http://hdl.handle.net/10150/604259.

Testo completo
Abstract (sommario):
ITC/USA 2006 Conference Proceedings / The Forty-Second Annual International Telemetering Conference and Technical Exhibition / October 23-26, 2006 / Town and Country Resort & Convention Center, San Diego, California
As more aircraft are fitted with solid state memory recording systems, the need for a large data archival storage system becomes increasingly important. In addition, there is a need to keep classified and unclassified data separate but available to the aircrews for training and debriefing along with some type of system for cataloging and searching for specific missions. This paper will present a novel approach along with a reference design for using commercially available hardware and software and a minimal amount of custom programming to help address these issues.
Gli stili APA, Harvard, Vancouver, ISO e altri
9

Lockard, Michael T., R. Rajagopalan e James A. Garling. "MINING IRIG-106 CHAPTER 10 AND HDF-5 DATA". International Foundation for Telemetering, 2006. http://hdl.handle.net/10150/604264.

Testo completo
Abstract (sommario):
ITC/USA 2006 Conference Proceedings / The Forty-Second Annual International Telemetering Conference and Technical Exhibition / October 23-26, 2006 / Town and Country Resort & Convention Center, San Diego, California
Rapid access to ever-increasing amounts of test data is becoming a problem. The authors have developed a data-mining methodology solution approach to provide a solution to catalog test files, search metadata attributes to derive test data files of interest, and query test data measurements using a web-based engine to produce results in seconds. Generated graphs allow the user to visualize an overview of the entire test for a selected set of measurements, with areas highlighted where the query conditions were satisfied. The user can then zoom into areas of interest and export selected information.
Gli stili APA, Harvard, Vancouver, ISO e altri
10

Srinivasan, Uma Computer Science &amp Engineering Faculty of Engineering UNSW. "A FRAMEWORK FOR CONCEPTUAL INTEGRATION OF HETEROGENEOUS DATABASES". Awarded by:University of New South Wales. School of Computer Science and Engineering, 1997. http://handle.unsw.edu.au/1959.4/33463.

Testo completo
Abstract (sommario):
Autonomy of operations combined with decentralised management of data has given rise to a number of heterogeneous databases or information systems within an enterprise. These systems are often incompatible in structure as well as content and hence difficult to integrate. This thesis investigates the problem of heterogeneous database integration, in order to meet the increasing demand for obtaining meaningful information from multiple databases without disturbing local autonomy. In spite of heterogeneity, the unity of overall purpose within a common application domain, nevertheless, provides a degree of semantic similarity which manifests itself in the form of similar data structures and common usage patterns of existing information systems. This work introduces a conceptual integration approach that exploits the similarity in meta level information in existing systems and performs metadata mining on database objects to discover a set of concepts common to heterogeneous databases within the same application domain. The conceptual integration approach proposed here utilises the background knowledge available in database structures and usage patterns and generates a set of concepts that serve as a domain abstraction and provide a conceptual layer above existing legacy systems. This conceptual layer is further utilised by an information re-engineering framework that customises and packages information to reflect the unique needs of different user groups within the application domain. The architecture of the information re-engineering framework is based on an object-oriented model that represents the discovered concepts as customised application objects for each distinct user group.
Gli stili APA, Harvard, Vancouver, ISO e altri
11

Kamenieva, Iryna. "Research Ontology Data Models for Data and Metadata Exchange Repository". Thesis, Växjö University, School of Mathematics and Systems Engineering, 2009. http://urn.kb.se/resolve?urn=urn:nbn:se:vxu:diva-6351.

Testo completo
Abstract (sommario):

For researches in the field of the data mining and machine learning the necessary condition is an availability of various input data set. Now researchers create the databases of such sets. Examples of the following systems are: The UCI Machine Learning Repository, Data Envelopment Analysis Dataset Repository, XMLData Repository, Frequent Itemset Mining Dataset Repository. Along with above specified statistical repositories, the whole pleiad from simple filestores to specialized repositories can be used by researchers during solution of applied tasks, researches of own algorithms and scientific problems. It would seem, a single complexity for the user will be search and direct understanding of structure of so separated storages of the information. However detailed research of such repositories leads us to comprehension of deeper problems existing in usage of data. In particular a complete mismatch and rigidity of data files structure with SDMX - Statistical Data and Metadata Exchange - standard and structure used by many European organizations, impossibility of preliminary data origination to the concrete applied task, lack of data usage history for those or other scientific and applied tasks.

Now there are lots of methods of data miming, as well as quantities of data stored in various repositories. In repositories there are no methods of DM (data miming) and moreover, methods are not linked to application areas. An essential problem is subject domain link (problem domain), methods of DM and datasets for an appropriate method. Therefore in this work we consider the building problem of ontological models of DM methods, interaction description of methods of data corresponding to them from repositories and intelligent agents allowing the statistical repository user to choose the appropriate method and data corresponding to the solved task. In this work the system structure is offered, the intelligent search agent on ontological model of DM methods considering the personal inquiries of the user is realized.

For implementation of an intelligent data and metadata exchange repository the agent oriented approach has been selected. The model uses the service oriented architecture. Here is used the cross platform programming language Java, multi-agent platform Jadex, database server Oracle Spatial 10g, and also the development environment for ontological models - Protégé Version 3.4.

Gli stili APA, Harvard, Vancouver, ISO e altri
12

Šmerda, Vojtěch. "Grafický editor metadat pro OLAP". Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2008. http://www.nusl.cz/ntk/nusl-235893.

Testo completo
Abstract (sommario):
This thesis describes OLAP and data mining technologies and methods of their communication with users by using dynamic tables. Key theoretical and technical information is also included. Next part focuses on particular implementation of dynamic tables used in Vema portal solution. Last parts come close to analysis and implementation of the metadata editor which enables the metadata to be effectively designed.
Gli stili APA, Harvard, Vancouver, ISO e altri
13

Alserafi, Ayman. "Dataset proximity mining for supporting schema matching and data lake governance". Doctoral thesis, Universitat Politècnica de Catalunya, 2021. http://hdl.handle.net/10803/671540.

Testo completo
Abstract (sommario):
With the huge growth in the amount of data generated by information systems, it is common practice today to store datasets in their raw formats (i.e., without any data preprocessing or transformations) in large-scale data repositories called Data Lakes (DLs). Such repositories store datasets from heterogeneous subject-areas (covering many business topics) and with many different schemata. Therefore, it is a challenge for data scientists using the DL for data analysis to find relevant datasets for their analysis tasks without any support or data governance. The goal is to be able to extract metadata and information about datasets stored in the DL to support the data scientist in finding relevant sources. This shapes the main goal of this thesis, where we explore different techniques of data profiling, holistic schema matching and analysis recommendation to support the data scientist. We propose a novel framework based on supervised machine learning to automatically extract metadata describing datasets, including computation of their similarities and data overlaps using holistic schema matching techniques. We use the extracted relationships between datasets in automatically categorizing them to support the data scientist in finding relevant datasets with intersection between their data. This is done via a novel metadata-driven technique called proximity mining which consumes the extracted metadata via automated data mining algorithms in order to detect related datasets and to propose relevant categories for them. We focus on flat (tabular) datasets organised as rows of data instances and columns of attributes describing the instances. Our proposed framework uses the following four main techniques: (1) Instance-based schema matching for detecting relevant data items between heterogeneous datasets, (2) Dataset level metadata extraction and proximity mining for detecting related datasets, (3) Attribute level metadata extraction and proximity mining for detecting related datasets, and finally, (4) Automatic dataset categorization via supervised k-Nearest-Neighbour (kNN) techniques. We implement our proposed algorithms via a prototype that shows the feasibility of this framework. We apply the prototype in an experiment on a real-world DL scenario to prove the feasibility, effectiveness and efficiency of our approach, whereby we were able to achieve high recall rates and efficiency gains while improving the computational space and time consumption by two orders of magnitude via our proposed early-pruning and pre-filtering techniques in comparison to classical instance-based schema matching techniques. This proves the effectiveness of our proposed automatic methods in the early-pruning and pre-filtering tasks for holistic schema matching and the automatic dataset categorisation, while also demonstrating improvements over human-based data analysis for the same tasks.
Amb l’enorme creixement de la quantitat de dades generades pels sistemes d’informació, és habitual avui en dia emmagatzemar conjunts de dades en els seus formats bruts (és a dir, sense cap pre-processament de dades ni transformacions) en dipòsits de dades a gran escala anomenats Data Lakes (DL). Aquests dipòsits emmagatzemen conjunts de dades d’àrees temàtiques heterogènies (que abasten molts temes empresarials) i amb molts esquemes diferents. Per tant, és un repte per als científics de dades que utilitzin la DL per a l’anàlisi de dades trobar conjunts de dades rellevants per a les seves tasques d’anàlisi sense cap suport ni govern de dades. L’objectiu és poder extreure metadades i informació sobre conjunts de dades emmagatzemats a la DL per donar suport al científic en trobar fonts rellevants. Aquest és l’objectiu principal d’aquesta tesi, on explorem diferents tècniques de perfilació de dades, concordança d’esquemes holístics i recomanació d’anàlisi per donar suport al científic. Proposem un nou marc basat en l’aprenentatge automatitzat supervisat per extreure automàticament metadades que descriuen conjunts de dades, incloent el càlcul de les seves similituds i coincidències de dades mitjançant tècniques de concordança d’esquemes holístics. Utilitzem les relacions extretes entre conjunts de dades per categoritzar-les automàticament per donar suport al científic del fet de trobar conjunts de dades rellevants amb la intersecció entre les seves dades. Això es fa mitjançant una nova tècnica basada en metadades anomenada mineria de proximitat que consumeix els metadades extrets mitjançant algoritmes automatitzats de mineria de dades per tal de detectar conjunts de dades relacionats i proposar-ne categories rellevants. Ens centrem en conjunts de dades plans (tabulars) organitzats com a files d’instàncies de dades i columnes d’atributs que descriuen les instàncies. El nostre marc proposat utilitza les quatre tècniques principals següents: (1) Esquema de concordança basat en instàncies per detectar ítems rellevants de dades entre conjunts de dades heterogènies, (2) Extracció de metadades de nivell de dades i mineria de proximitat per detectar conjunts de dades relacionats, (3) Extracció de metadades a nivell de atribut i mineria de proximitat per detectar conjunts de dades relacionats i, finalment, (4) Categorització de conjunts de dades automàtica mitjançant tècniques supervisades per k-Nearest-Neighbour (kNN). Posem en pràctica els nostres algorismes proposats mitjançant un prototip que mostra la viabilitat d’aquest marc. El prototip s’experimenta en un escenari DL real del món per demostrar la viabilitat, l’eficàcia i l’eficiència del nostre enfocament, de manera que hem pogut aconseguir elevades taxes de record i guanys d’eficiència alhora que millorem el consum computacional d’espai i temps mitjançant dues ordres de magnitud mitjançant el nostre es van proposar tècniques de poda anticipada i pre-filtratge en comparació amb tècniques de concordança d’esquemes basades en instàncies clàssiques. Això demostra l'efectivitat dels nostres mètodes automàtics proposats en les tasques de poda inicial i pre-filtratge per a la coincidència d'esquemes holístics i la classificació automàtica del conjunt de dades, tot demostrant també millores en l'anàlisi de dades basades en humans per a les mateixes tasques.
Avec l’énorme croissance de la quantité de données générées par les systèmes d’information, il est courant aujourd’hui de stocker des ensembles de données (datasets) dans leurs formats bruts (c’est-à-dire sans prétraitement ni transformation de données) dans des référentiels de données à grande échelle appelés Data Lakes (DL). Ces référentiels stockent des ensembles de données provenant de domaines hétérogènes (couvrant de nombreux sujets commerciaux) et avec de nombreux schémas différents. Par conséquent, il est difficile pour les data-scientists utilisant les DL pour l’analyse des données de trouver des datasets pertinents pour leurs tâches d’analyse sans aucun support ni gouvernance des données. L’objectif est de pouvoir extraire des métadonnées et des informations sur les datasets stockés dans le DL pour aider le data-scientist à trouver des sources pertinentes. Cela constitue l’objectif principal de cette thèse, où nous explorons différentes techniques de profilage de données, de correspondance holistique de schéma et de recommandation d’analyse pour soutenir le data-scientist. Nous proposons une nouvelle approche basée sur l’intelligence artificielle, spécifiquement l’apprentissage automatique supervisé, pour extraire automatiquement les métadonnées décrivant les datasets, calculer automatiquement les similitudes et les chevauchements de données entre ces ensembles en utilisant des techniques de correspondance holistique de schéma. Les relations entre datasets ainsi extraites sont utilisées pour catégoriser automatiquement les datasets, afin d’aider le data-scientist à trouver des datasets pertinents avec intersection entre leurs données. Cela est fait via une nouvelle technique basée sur les métadonnées appelée proximity mining, qui consomme les métadonnées extraites via des algorithmes de data mining automatisés afin de détecter des datasets connexes et de leur proposer des catégories pertinentes. Nous nous concentrons sur des datasets plats (tabulaires) organisés en rangées d’instances de données et en colonnes d’attributs décrivant les instances. L’approche proposée utilise les quatres principales techniques suivantes: (1) Correspondance de schéma basée sur l’instance pour détecter les éléments de données pertinents entre des datasets hétérogènes, (2) Extraction de métadonnées au niveau du dataset et proximity mining pour détecter les datasets connexes, (3) Extraction de métadonnées au niveau des attributs et proximity mining pour détecter des datasets connexes, et enfin, (4) catégorisation automatique des datasets via des techniques supervisées k-Nearest-Neighbour (kNN). Nous implémentons les algorithmes proposés via un prototype qui montre la faisabilité de cette approche. Nous appliquons ce prototype à une scénario DL du monde réel pour prouver la faisabilité, l’efficacité et l’efficience de notre approche, nous permettant d’atteindre des taux de rappel élevés et des gains d’efficacité, tout en diminuant le coût en espace et en temps de deux ordres de grandeur, via nos techniques proposées d’élagage précoce et de pré-filtrage, comparé aux techniques classiques de correspondance de schémas basées sur les instances. Cela prouve l’efficacité des méthodes automatiques proposées dans les tâches d’élagage précoce et de pré-filtrage pour la correspondance de schéma holistique et la cartegorisation automatique des datasets, tout en démontrant des améliorations par rapport à l’analyse de données basée sur l’humain pour les mêmes tâches.
Gli stili APA, Harvard, Vancouver, ISO e altri
14

Savalli, Antonino. "Tecniche analitiche per “Open Data”". Master's thesis, Alma Mater Studiorum - Università di Bologna, 2019. http://amslaurea.unibo.it/17476/.

Testo completo
Abstract (sommario):
L’ultimo decennio ha reso estremamente popolare il concetto di Open Government, un modello di amministrazione aperto che fonda le sue basi sui principi di trasparenza, partecipazione e collaborazione. Nel 2011, nasce il progetto Dati.gov.it, un portale che ha il ruolo di “catalogo nazionale dei metadati relativi ai dati rilasciati in formato aperto dalle pubbliche amministrazioni italiane”. L'obiettivo della tesi è fornire un efficace strumento per ricercare, usare e confrontare le informazioni presenti sul portale Dati.gov.it, individuando tra i dataset similarità che possano risolvere e/o limitare l’eterogeneità dei dati presenti. Il progetto consiste nello sviluppo su tre aree di studio principali: Standard di Open Data e Metadata, Record Linkage e Data Fusion. Nello specifico, sono state implementate sette funzioni contenute in un'unica libreria. La funzione search permette di ricercare all'interno del portale dati.gov.it. La funzione ext permette di estrarre le informazioni da sette formati sorgente: csv, json, xml, xls, rdf, pdf e txt. La funzione pre-process permette il Data Cleaning. La funzione find_claims è il cuore del progetto, perché contiene l'algoritmo di Text Mining che stabilisce una relazione tra i dataset individuando le parole in comune che hanno una sufficiente importanza all'interno del contesto. La funzione header_linkage permette di trovare la similarità tra i nomi degli attributi di due dataset, consigliando quali attributi concatenare. In modo analogo, record_linkage permette di trovare similarità tra i valori degli attributi di due dataset, consigliando quali attributi concatenare. Infine, la funzione merge_keys permette di fondere i risultati di header_linkage e record_linkage. I risultati sperimentali hanno fornito feedback positivi sul funzionamento dei principali metodi implementati per quanto concerne la similarità sintattica tra due dataset.
Gli stili APA, Harvard, Vancouver, ISO e altri
15

Bauckmann, Jana, Ziawasch Abedjan, Ulf Leser, Heiko Müller e Felix Naumann. "Covering or complete? : Discovering conditional inclusion dependencies". Universität Potsdam, 2012. http://opus.kobv.de/ubp/volltexte/2012/6208/.

Testo completo
Abstract (sommario):
Data dependencies, or integrity constraints, are used to improve the quality of a database schema, to optimize queries, and to ensure consistency in a database. In the last years conditional dependencies have been introduced to analyze and improve data quality. In short, a conditional dependency is a dependency with a limited scope defined by conditions over one or more attributes. Only the matching part of the instance must adhere to the dependency. In this paper we focus on conditional inclusion dependencies (CINDs). We generalize the definition of CINDs, distinguishing covering and completeness conditions. We present a new use case for such CINDs showing their value for solving complex data quality tasks. Further, we define quality measures for conditions inspired by precision and recall. We propose efficient algorithms that identify covering and completeness conditions conforming to given quality thresholds. Our algorithms choose not only the condition values but also the condition attributes automatically. Finally, we show that our approach efficiently provides meaningful and helpful results for our use case.
Datenabhängigkeiten (wie zum Beispiel Integritätsbedingungen), werden verwendet, um die Qualität eines Datenbankschemas zu erhöhen, um Anfragen zu optimieren und um Konsistenz in einer Datenbank sicherzustellen. In den letzten Jahren wurden bedingte Abhängigkeiten (conditional dependencies) vorgestellt, die die Qualität von Daten analysieren und verbessern sollen. Eine bedingte Abhängigkeit ist eine Abhängigkeit mit begrenztem Gültigkeitsbereich, der über Bedingungen auf einem oder mehreren Attributen definiert wird. In diesem Bericht betrachten wir bedingte Inklusionsabhängigkeiten (conditional inclusion dependencies; CINDs). Wir generalisieren die Definition von CINDs anhand der Unterscheidung von überdeckenden (covering) und vollständigen (completeness) Bedingungen. Wir stellen einen Anwendungsfall für solche CINDs vor, der den Nutzen von CINDs bei der Lösung komplexer Datenqualitätsprobleme aufzeigt. Darüber hinaus definieren wir Qualitätsmaße für Bedingungen basierend auf Sensitivität und Genauigkeit. Wir stellen effiziente Algorithmen vor, die überdeckende und vollständige Bedingungen innerhalb vorgegebener Schwellwerte finden. Unsere Algorithmen wählen nicht nur die Werte der Bedingungen, sondern finden auch die Bedingungsattribute automatisch. Abschließend zeigen wir, dass unser Ansatz effizient sinnvolle und hilfreiche Ergebnisse für den vorgestellten Anwendungsfall liefert.
Gli stili APA, Harvard, Vancouver, ISO e altri
16

Alves, Luiz Gustavo Pacola. "CollaboraTVware: uma infra-estrutura ciente de contexto para suporte a participação colaborativa no cenário da TV Digital Interativa". Universidade de São Paulo, 2008. http://www.teses.usp.br/teses/disponiveis/3/3141/tde-20072009-114734/.

Testo completo
Abstract (sommario):
O advento da TV Digital Interativa no mundo modifica, em definitivo, a experiência do usuário em assistir a TV, tornando-a mais rica principalmente pelo uso do recurso da interatividade. Os usuários passam a ser pró-ativos e começam a interagir das mais diversas formas: construção de comunidades virtuais, discussão sobre um determinado conteúdo, envio de mensagens e recomendações, dentre outras. Neste cenário a participação dos usuários de forma colaborativa assume um papel importante e essencial. Aliado a isso, a recepção na TV Digital Interativa é feita através de dispositivos computacionais que, devido à convergência digital, estão presentes cada vez mais em meios ubíquos. Um outro fator preponderante a considerar, resultante desta mídia, corresponde ao crescimento da quantidade e diversidade de programas e serviços interativos disponíveis, dificultando, assim, a seleção de conteúdo de maior relevância. Diante dos fatos expostos, esta pesquisa tem como principal objetivo propor e implementar uma infra-estrutura de software no cenário da TV Digital Interativa intitulada CollaboraTVware para orientar, de forma transparente, os usuários na escolha de programas e serviços interativos através da participação colaborativa de outros usuários com perfis e contextos similares. No escopo deste trabalho, a participação colaborativa corresponde às avaliações atribuídas por usuários no sentido de expressar opiniões sobre os conteúdos veiculados. As modelagens de usuário, do dispositivo utilizado e do contexto da interação do usuário, essenciais para o desenvolvimento do CollaboraTVware, são representadas por padrões de metadados flexíveis usados no domínio da TV Digital Interativa (MPEG-7, MPEG-21 e TV-Anytime), e suas devidas extensões. A arquitetura do CollaboraTVware é composta por dois subsistemas: dispositivo do usuário e provedor de serviços. A tarefa de classificação, da teoria de mineração de dados, é a abordagem adotada na concepção da infra-estrutura. O conceito de perfil de uso participativo é apresentado e discutido. Para demonstrar e validar as funcionalidades do CollaboraTVware em um cenário de uso, foi desenvolvida uma aplicação (EPG colaborativo) como estudo de caso.
The advent of the Interactive Digital TV around the world transforms, ultimately, the user experience in watching TV, making it richer mainly by enabling user interactivity. The users become pro-active and begin to interact with very different ways: building virtual communities, discussion about contents, sending messages and recommendations etc. In this scenario the user participation in a collaborative assumes an important and essential role. Additionally, the reception in Interactive Digital TV is done by devices that due to digital convergence are increasingly present in ubiquitous environments. Another preponderant issue to consider, resulting from this media, is the growing of the number and diversity of programs and interactive services available, increasing the difficulty of selecting relevant content. Thus, the main objective of this work is to propose and implement a software infrastructure in an Interactive Digital Television environment entitled CollaboraTVware to guide in a transparent way, users in the choice of programs and interactive services through the collaborative participation of other users with similar profiles and contexts. In the scope of this work, the collaborative participation corresponds to the rating given by users in order to express opinions about the content transmitted. The modeling of user, device used and context of user interaction, essential for the development of CollaboraTVware, are represented by granular metadata standards used in the field of Interactive Digital TV (MPEG-7, MPEG-21 and TV-Anytime), and its extensions needed. The CollaboraTVware architecture is composed of two subsystems: user device and service provider. The classification task, from the theory of data mining, is the approach adopted in the infrastructure design. The concept of participative usage profile is presented and discussed. To demonstrate the functionalities in a use scenario, was developed an application (collaborative EPG) as a case study which uses the CollaboraTVware.
Gli stili APA, Harvard, Vancouver, ISO e altri
17

Raad, Elie. "Découverte des relations dans les réseaux sociaux". Phd thesis, Université de Bourgogne, 2011. http://tel.archives-ouvertes.fr/tel-00702269.

Testo completo
Abstract (sommario):
Les réseaux sociaux occupent une place de plus en plus importante dans notre vie quotidienne et représentent une part considérable des activités sur le web. Ce succès s'explique par la diversité des services/fonctionnalités de chaque site (partage des données souvent multimédias, tagging, blogging, suggestion de contacts, etc.) incitant les utilisateurs à s'inscrire sur différents sites et ainsi à créer plusieurs réseaux sociaux pour diverses raisons (professionnelle, privée, etc.). Cependant, les outils et les sites existants proposent des fonctionnalités limitées pour identifier et organiser les types de relations ne permettant pas de, entre autres, garantir la confidentialité des utilisateurs et fournir un partage plus fin des données. Particulièrement, aucun site actuel ne propose une solution permettant d'identifier automatiquement les types de relations en tenant compte de toutes les données personnelles et/ou celles publiées. Dans cette étude, nous proposons une nouvelle approche permettant d'identifier les types de relations à travers un ou plusieurs réseaux sociaux. Notre approche est basée sur un framework orientéutilisateur qui utilise plusieurs attributs du profil utilisateur (nom, age, adresse, photos, etc.). Pour cela, nous utilisons des règles qui s'appliquent à deux niveaux de granularité : 1) au sein d'un même réseau social pour déterminer les relations sociales (collègues, parents, amis, etc.) en exploitant principalement les caractéristiques des photos et leurs métadonnées, et, 2) à travers différents réseaux sociaux pour déterminer les utilisateurs co-référents (même personne sur plusieurs réseaux sociaux) en étant capable de considérer tous les attributs du profil auxquels des poids sont associés selon le profil de l'utilisateur et le contenu du réseau social. À chaque niveau de granularité, nous appliquons des règles de base et des règles dérivées pour identifier différents types de relations. Nous mettons en avant deux méthodologies distinctes pour générer les règles de base. Pour les relations sociales, les règles de base sont créées à partir d'un jeu de données de photos créées en utilisant le crowdsourcing. Pour les relations de co-référents, en utilisant tous les attributs, les règles de base sont générées à partir des paires de profils ayant des identifiants de mêmes valeurs. Quant aux règles dérivées, nous utilisons une technique de fouille de données qui prend en compte le contexte de chaque utilisateur en identifiant les règles de base fréquemment utilisées. Nous présentons notre prototype, intitulé RelTypeFinder, que nous avons implémenté afin de valider notre approche. Ce prototype permet de découvrir différents types de relations, générer des jeux de données synthétiques, collecter des données du web, et de générer les règles d'extraction. Nous décrivons les expériementations que nous avons menées sur des jeux de données réelles et syntéthiques. Les résultats montrent l'efficacité de notre approche à découvrir les types de relations.
Gli stili APA, Harvard, Vancouver, ISO e altri
18

Moreux, Jean-Philippe, e Guillaume Chiron. "Image Retrieval in Digital Libraries: A Large Scale Multicollection Experimentation of Machine Learning techniques". Sächsische Landesbibliothek - Staats- und Universitätsbibliothek Dresden, 2017. https://slub.qucosa.de/id/qucosa%3A16444.

Testo completo
Abstract (sommario):
While historically digital heritage libraries were first powered in image mode, they quickly took advantage of OCR technology to index printed collections and consequently improve the scope and performance of the information retrieval services offered to users. But the access to iconographic resources has not progressed in the same way, and the latter remain in the shadows: manual incomplete and heterogeneous indexation, data silos by iconographic genre. Today, however, it would be possible to make better use of these resources, especially by exploiting the enormous volumes of OCR produced during the last two decades, and thus valorize these engravings, drawings, photographs, maps, etc. for their own value but also as an attractive entry point into the collections, supporting discovery and serenpidity from document to document and collection to collection. This article presents an ETL (extract-transform-load) approach to this need, that aims to: Identify and extract iconography wherever it may be found, in image collections but also in printed materials (dailies, magazines, monographies); Transform, harmonize and enrich the image descriptive metadata (in particular with machine learning classification tools); Load it all into a web app dedicated to image retrieval. The approach is pragmatically dual, since it involves leveraging existing digital resources and (virtually) on-the-shelf technologies.
Si historiquement, les bibliothèques numériques patrimoniales furent d’abord alimentées par des images, elles profitèrent rapidement de la technologie OCR pour indexer les collections imprimées afin d’améliorer périmètre et performance du service de recherche d’information offert aux utilisateurs. Mais l’accès aux ressources iconographiques n’a pas connu les mêmes progrès et ces dernières demeurent dans l’ombre : indexation manuelle lacunaire, hétérogène et non viable à grande échelle ; silos documentaires par genre iconographique ; recherche par le contenu (CBIR, content-based image retrieval) encore peu opérationnelle sur les collections patrimoniales. Aujourd’hui, il serait pourtant possible de mieux valoriser ces ressources, en particulier en exploitant les énormes volumes d’OCR produits durant les deux dernières décennies (tant comme descripteur textuel que pour l’identification automatique des illustrations imprimées). Et ainsi mettre en valeur ces gravures, dessins, photographies, cartes, etc. pour leur valeur propre mais aussi comme point d’entrée dans les collections, en favorisant découverte et rebond de document en document, de collection à collection. Cet article décrit une approche ETL (extract-transform-load) appliquée aux images d’une bibliothèque numérique à vocation encyclopédique : identifier et extraire l’iconographie partout où elle se trouve (dans les collections image mais aussi dans les imprimés : presse, revue, monographie) ; transformer, harmoniser et enrichir ses métadonnées descriptives grâce à des techniques d’apprentissage machine – machine learning – pour la classification et l’indexation automatiques ; charger ces données dans une application web dédiée à la recherche iconographique (ou dans d’autres services de la bibliothèque). Approche qualifiée de pragmatique à double titre, puisqu’il s’agit de valoriser des ressources numériques existantes et de mettre à profit des technologies (quasiment) mâtures.
Gli stili APA, Harvard, Vancouver, ISO e altri
19

Engvall, Tove. "Nyckeln till arkiven : En kritisk diskursanalytisk studie om interoperabilitet och kollektivt minne". Thesis, Mittuniversitetet, Avdelningen för arkiv- och datavetenskap, 2014. http://urn.kb.se/resolve?urn=urn:nbn:se:miun:diva-21171.

Testo completo
Abstract (sommario):
In the democratic process, of discussion and decision making, there is a need of reliable and authentic information. Archives are authentic and reliable information and also provides long term accessibility. But the public archives potential isn´t utilized enough at a societal level. The public archives are organized in a decentralised manner, and there are no common accesspoint at a national level. In the thesis this issue of accessibility and use at a societal level, is discussed in terms of collective memory. In a digital environment, these organizational limits could probably be overcome, but there is a need for new goals, perspectives and frameworks for the management of public archives. In an e-government context, interoperability is often mentioned in the discussion of accessibility. Interoperability could be understood as the ability of diverse organizations to interact together towards common goals, and include technological, semantic, organizational, legal and political aspects. The hypothesis like assumption of the thesis is that interoperability could contribute to making the archives a more significant part of the collective memory.The thesis uses a case study methodology, and a critical discourse textual analyses. Records Continuum Model and archival perspective about collective memory, particularly Jimersons distinction of collective memory from other types of memory, is used as a theoretical frame for the analyses. The case is the project e-archives and e-diarium, which is a Swedish E-delegation project, driven by the National Archives. Central documents from the project is analysed, as is important documents for the work on e-government for the contextual understanding.The results indicate that interoperability may contribute to making the archives a more significant part of the collective memory, practically and discursively. Practically, it provides conditions to share information and remove barriers for interaction. Discursively, it contributes to an overall perspective of public administration, and switch the view from each single organization to the citizens and the society as a whole where the information is seen as a common societal resource. Interoperability is also an important factor in the development of a common information architecture for the whole of the public administration. This change in perspective could make the archives, when included in the e-government, change perspective from the archives creators to the end users and society at large, and give more effort in making the archives accessible in the collective dimension.
Gli stili APA, Harvard, Vancouver, ISO e altri
20

Langenberg, Tristan Matthias [Verfasser], Florentin [Akademischer Betreuer] Wörgötter, Florentin [Gutachter] Wörgötter, Carsten [Gutachter] Damm, Wolfgang [Gutachter] May, Jens [Gutachter] Grabowski, Stephan [Gutachter] Waack e Minija [Gutachter] Tamosiunaite. "Deep Learning Metadata Fusion for Traffic Light to Lane Assignment / Tristan Matthias Langenberg ; Gutachter: Florentin Wörgötter, Carsten Damm, Wolfgang May, Jens Grabowski, Stephan Waack, Minija Tamosiunaite ; Betreuer: Florentin Wörgötter". Göttingen : Niedersächsische Staats- und Universitätsbibliothek Göttingen, 2019. http://d-nb.info/1191989100/34.

Testo completo
Gli stili APA, Harvard, Vancouver, ISO e altri
21

Γιαννακούδη, Θεοδούλα. "Προηγμένες τεχνικές και αλγόριθμοι εξόρυξης γνώσης για την προσωποποίηση της πρόσβασης σε δικτυακούς τόπους". 2005. http://nemertes.lis.upatras.gr/jspui/handle/10889/151.

Testo completo
Abstract (sommario):
Η προσωποποίηση του ιστού είναι ένα πεδίο που έχει κερδίσει μεγάλη προσοχή όχι μόνο στην ερευνητική περιοχή, όπου πολλές ερευνητικές μονάδες έχουν ασχοληθεί με το πρόβλημα από διαφορετικές μεριές, αλλά και στην επιχειρησιακή περιοχή, όπου υπάρχει μία ποικιλία εργαλείων και εφαρμογών που διαθέτουν ένα ή περισσότερα modules στη διαδικασία της εξατομίκευσης. Ο στόχος όλων αυτών είναι, εξερευνώντας τις πληροφορίες που κρύβονται στα logs του εξυπηρετητή δικτύου να ανακαλύψουν τις αλληλεπιδράσεις μεταξύ των επισκεπτών των ιστότοπων και των ιστοσελίδων που περιέχονται σε αυτούς. Οι πληροφορίες αυτές μπορούν να αξιοποιηθούν για τη βελτιστοποίηση των δικτυακών τόπων, εξασφαλίζοντας έτσι αποτελεσματικότερη πλοήγηση για τον επισκέπτη και διατήρηση του πελάτη στην περίπτωση του επιχειρηματικού τομέα. Ένα βασικό βήμα πριν την εξατομίκευση αποτελεί η εξόρυξη χρησιμοποίησης από τον ιστό, ώστε να αποκαλυφθεί τη γνώση που κρύβεται στα log αρχεία ενός web εξυπηρετητή. Εφαρμόζοντας στατιστικές μεθόδους και μεθόδους εξόρυξης δεδομένων στα web log δεδομένα, μπορούν να προσδιοριστούν ενδιαφέροντα πρότυπα που αφορούν τη συμπεριφορά πλοήγησης των χρηστών, όπως συστάδες χρηστών και σελίδων και πιθανές συσχετίσεις μεταξύ web σελίδων και ομάδων χρηστών. Τα τελευταία χρόνια, γίνεται μια προσπάθεια συγχώνευσης του περιεχομένου του ιστού στη διαδικασία εξόρυξης χρησιμοποίησης, για να επαυξηθεί η αποτελεσματικότητα της εξατομίκευσης. Το ενδιαφέρον σε αυτή τη διπλωματική εργασία εστιάζεται στο πεδίο της εξόρυξης γνώσης για τη χρησιμοποίηση δικτυακών τόπων και πώς η διαδικασία αυτή μπορεί να επωφεληθεί από τα χαρακτηριστικά του σημασιολογικού ιστού. Αρχικά, παρουσιάζονται τεχνικές και αλγόριθμοι που έχουν προταθεί τα τελευταία χρόνια για εξόρυξη χρησιμοποίησης από τα log αρχεία των web εξυπηρετητών. Έπειτα εισάγεται και ο ρόλος του περιεχομένου στη διαδικασία αυτή και παρουσιάζονται δύο εργασίες που λαμβάνουν υπόψη και το περιεχόμενο των δικτυακών τόπων: μία τεχνική εξόρυξης χρησιμοποίησης με βάση το PLSA, η οποία δίνει στο τέλος και τη δυνατότητα ενοποίησης του περιεχομένου του ιστού και ένα σύστημα προσωποποίησης το οποίο χρησιμοποιεί το περιεχόμενο του ιστοτόπου για να βελτιώσει την αποτελεσματικότητα της μηχανής παραγωγής προτάσεων. Αφού αναλυθεί θεωρητικά το πεδίο εξόρυξης γνώσης από τα logs μέσα από την περιγραφή των σύγχρονων τεχνικών, προτείνεται το σύστημα ORGAN-Ontology-oRiented usaGe ANalysis- το οποίο αφορά στη φάση της ανάλυσης των log αρχείων και την εξόρυξη γνώσης για τη χρησιμοποίηση των δικτυακών τόπων με άξονα τη σημασιολογία του ιστοτόπου. Τα σημασιολογικά χαρακτηριστικά του δικτυακού τόπου έχουν προκύψει με τεχνικές εξόρυξης δεδομένων από το σύνολο των ιστοσελίδων και έχουν σχολιαστεί από μία OWL οντολογία. Το ORGAN παρέχει διεπαφή για την υποβολή ερωτήσεων σχετικών με την επισκεψιμότητα και τη σημασιολογία των σελίδων, αξιοποιώντας τη γνώση για το site, όπως αναπαρίσταται πάνω στην οντολογία. Περιγράφεται διεξοδικά ο σχεδιασμός, η ανάπτυξη και η πειραματική αξιολόγηση του συστήματος και σχολιάζονται τα αποτελέσματα του.
Web personalization is a domain which has gained great momentum not only in the research area, where many research units have addressed the problem form different perspectives, but also in the industrial area, where a variety of modules for the personalization process is available. The objective is, researching the information hidden in the web server log files to discover the interactions between web sites visitors and web sites pages. This information can be further exploited for web sites optimization, ensuring more effective navigation for the user and client retention in the industrial case. A primary step before the personalization is the web usage mining, where the knowledge hidden in the log files is revealed. Web usage mining is the procedure where the information stored in the Web server logs is processed by applying statistical and data mining techniques such as clustering, association rules discovery, classification, and sequential pattern discovery, in order to reveal useful patterns that can be further analyzed. Recently, there has been an effort to incorporate Web content in the web usage mining process, in order to enhance the effectiveness of personalization. The interest in this thesis is focused on the domain of the knowledge mining for usage of web sites and how this procedure can get the better of attributes of the semantic web. Initially, techniques and algorithms that have been proposed lately in the field of web usage mining are presented. After, the role of the context in the usage mining process is introduced and two relevant works are presented: a usage mining technique based on the PLSA model, which may integrate attributes of the site content, and a personalization system which uses the site content in order to enhance a recommendation engine. After analyzing theoretically the usage mining domain, a new system is proposed, the ORGAN, which is named after Ontology-oRiented usaGe ANalysis. ORGAN concerns the stage of log files analysis and the domain of knowledge mining for the web site usage based on the semantic attributes of the web site. The web site semantic attributes have resulted from the web site pages applying data mining techniques and have been annotated by an OWL ontology. ORGAN provides an interface for queries submission concerning the average level of visitation and the semantics of the web site pages, exploiting the knowledge for the site, as it is derived from the ontology. There is an extensive description of the design, the development and the experimental evaluation of the system.
Gli stili APA, Harvard, Vancouver, ISO e altri
22

Schöneberg, Hendrik. "Semiautomatische Metadaten-Extraktion und Qualitätsmanagement in Workflow-Systemen zur Digitalisierung historischer Dokumente". Doctoral thesis, 2014. https://nbn-resolving.org/urn:nbn:de:bvb:20-opus-104878.

Testo completo
Abstract (sommario):
Performing Named Entity Recognition on ancient documents is a time-consuming, complex and error-prone manual task. It is a prerequisite though to being able to identify related documents and correlate between named entities in distinct sources, helping to precisely recreate historic events. In order to reduce the manual effort, automated classification approaches could be leveraged. Classifying terms in ancient documents in an automated manner poses a difficult task due to the sources’ challenging syntax and poor conservation states. This thesis introduces and evaluates approaches that can cope with complex syntactial environments by using statistical information derived from a term’s context and combining it with domain-specific heuristic knowledge to perform a classification. Furthermore this thesis demonstrates how metadata generated by these approaches can be used as error heuristics to greatly improve the performance of workflow systems for digitizations of early documents
Die Extraktion von Metadaten aus historischen Dokumenten ist eine zeitintensive, komplexe und höchst fehleranfällige Tätigkeit, die üblicherweise vom menschlichen Experten übernommen werden muss. Sie ist jedoch notwendig, um Bezüge zwischen Dokumenten herzustellen, Suchanfragen zu historischen Ereignissen korrekt zu beantworten oder semantische Verknüpfungen aufzubauen. Um den manuellen Aufwand dieser Aufgabe reduzieren zu können, sollen Verfahren der Named Entity Recognition angewendet werden. Die Klassifikation von Termen in historischen Handschriften stellt jedoch eine große Herausforderung dar, da die Domäne eine hohe Schreibweisenvarianz durch unter anderem nur konventionell vereinbarte Orthographie mit sich bringt. Diese Arbeit stellt Verfahren vor, die auch in komplexen syntaktischen Umgebungen arbeiten können, indem sie auf Informationen aus dem Kontext der zu klassifizierenden Terme zurückgreifen und diese mit domänenspezifischen Heuristiken kombinieren. Weiterhin wird evaluiert, wie die so gewonnenen Metadaten genutzt werden können, um in Workflow-Systemen zur Digitalisierung historischer Handschriften Mehrwerte durch Heuristiken zur Produktionsfehlererkennung zu erzielen
Gli stili APA, Harvard, Vancouver, ISO e altri
Offriamo sconti su tutti i piani premium per gli autori le cui opere sono incluse in raccolte letterarie tematiche. Contattaci per ottenere un codice promozionale unico!

Vai alla bibliografia