Se connecter

Bibliographies thématiques / Inter-tagger agreement

Littérature scientifique sur le sujet « Inter-tagger agreement »

Auteur : Grafiati

Publié le 10 mars 2023

Créez une référence correcte selon les styles APA, MLA, Chicago, Harvard et plusieurs autres

Choisissez une source :

Sommaire

Articles de revues
Thèses

Consultez les listes thématiques d’articles de revues, de livres, de thèses, de rapports de conférences et d’autres sources académiques sur le sujet « Inter-tagger agreement ».

À côté de chaque source dans la liste de références il y a un bouton « Ajouter à la bibliographie ». Cliquez sur ce bouton, et nous générerons automatiquement la référence bibliographique pour la source choisie selon votre style de citation préféré : APA, MLA, Harvard, Vancouver, Chicago, etc.

Vous pouvez aussi télécharger le texte intégral de la publication scolaire au format pdf et consulter son résumé en ligne lorsque ces informations sont inclues dans les métadonnées.

Articles de revues sur le sujet "Inter-tagger agreement"

1

Mohamed Hanum, Haslizatul, Nur Atiqah Sia Abdullah et Zainab Abu Bakar. « Instruction Task for Malay Phrase Boundary Annotation ». International Journal of Engineering & ; Technology 7, n^o 3.15 (13 août 2018) : 137. http://dx.doi.org/10.14419/ijet.v7i3.15.17517.

Texte intégral

Résumé :

The paper presents a refined instruction task to assist evaluation of prosodic phrase (PPh) boundaries by naive listeners. The results from the perceptual experiments were compared to the boundaries produced by online automatic tagger. The Kappa evaluation shows the average of 85% on inter-rater agreement. More than 60% of the boundaries which are detected by the automatic tagger matched the reference boundaries, showing that the refined instruction task can be used to evaluate perception on phrase boundaries on continuous speech.

Styles APA, Harvard, Vancouver, ISO, etc.

2

NAVIGLI, ROBERTO. « A structural approach to the automatic adjudication of word sense disagreements ». Natural Language Engineering 14, n^o 4 (octobre 2008) : 547–73. http://dx.doi.org/10.1017/s1351324908004749.

Texte intégral

Résumé :

AbstractThe semantic annotation of texts with senses from a computational lexicon is a complex and often subjective task. As a matter of fact, the fine granularity of the WordNet sense inventory [Fellbaum, Christiane (ed.). 1998.WordNet: An Electronic Lexical DatabaseMIT Press], ade factostandard within the research community, is one of the main causes of a low inter-tagger agreement ranging between 70% and 80% and the disappointing performance of automated fine-grained disambiguation systems (around 65% state of the art in the Senseval-3 English all-words task). In order to improve the performance of both manual and automated sense taggers, either we change the sense inventory (e.g. adopting a new dictionary or clustering WordNet senses) or we aim at resolving the disagreements between annotators by dealing with the fineness of sense distinctions. The former approach is not viable in the short term, as wide-coverage resources are not publicly available and no large-scale reliable clustering of WordNet senses has been released to date. The latter approach requires the ability to distinguish between subtle or misleading sense distinctions. In this paper, we propose the use of structural semantic interconnections – a specific kind of lexical chains – for the adjudication of disagreed sense assignments to words in context. The approach relies on the exploitation of the lexicon structure as a support to smooth possible divergencies between sense annotators and foster coherent choices. We perform a twofold experimental evaluation of the approach applied to manual annotations from the SemCor corpus, and automatic annotations from the Senseval-3 English all-words competition. Both sets of experiments and results are entirely novel: structural adjudication allows to improve the state-of-the-art performance in all-words disambiguation by 3.3 points (achieving a 68.5% F1-score) and attains figures around 80% precision and 60% recall in the adjudication of disagreements from human annotators.

Styles APA, Harvard, Vancouver, ISO, etc.

3

Mundotiya, Rajesh Kumar, Manish Kumar Singh, Rahul Kapur, Swasti Mishra et Anil Kumar Singh. « Linguistic Resources for Bhojpuri, Magahi, and Maithili : Statistics about Them, Their Similarity Estimates, and Baselines for Three Applications ». ACM Transactions on Asian and Low-Resource Language Information Processing 20, n^o 6 (30 novembre 2021) : 1–37. http://dx.doi.org/10.1145/3458250.

Texte intégral

Résumé :

Corpus preparation for low-resource languages and for development of human language technology to analyze or computationally process them is a laborious task, primarily due to the unavailability of expert linguists who are native speakers of these languages and also due to the time and resources required. Bhojpuri, Magahi, and Maithili, languages of the Purvanchal region of India (in the north-eastern parts), are low-resource languages belonging to the Indo-Aryan (or Indic) family. They are closely related to Hindi, which is a relatively high-resource language, which is why we compare them with Hindi. We collected corpora for these three languages from various sources and cleaned them to the extent possible, without changing the data in them. The text belongs to different domains and genres. We calculated some basic statistical measures for these corpora at character, word, syllable, and morpheme levels. These corpora were also annotated with parts-of-speech (POS) and chunk tags. The basic statistical measures were both absolute and relative and were expected to indicate linguistic properties, such as morphological, lexical, phonological, and syntactic complexities (or richness). The results were compared with a standard Hindi corpus. For most of the measures, we tried to match the corpus size across the languages to avoid the effect of corpus size, but in some cases it turned out that using the full corpus was better, even if sizes were very different. Although the results are not very clear, we tried to draw some conclusions about the languages and the corpora. For POS tagging and chunking, the BIS tagset was used to manually annotate the data. The POS-tagged data sizes are 16,067, 14,669, and 12,310 sentences, respectively, for Bhojpuri, Magahi, and Maithili. The sizes for chunking are 9,695 and 1,954 sentences for Bhojpuri and Maithili, respectively. The inter-annotator agreement for these annotations, using Cohen’s Kappa, was 0.92, 0.64, and 0.74, respectively, for the three languages. These (annotated) corpora have been used for developing preliminary automated tools, which include POS tagger, Chunker, and Language Identifier. We have also developed the Bilingual dictionary (Purvanchal languages to Hindi) and a Synset (that can be integrated later in the Indo-WordNet) as additional resources. The main contribution of the work is the creation of basic resources for facilitating further language processing research for these languages, providing some quantitative measures about them and their similarities among themselves and with Hindi. For similarities, we use a somewhat novel measure of language similarity based on an n-gram-based language identification algorithm. An additional contribution is providing baselines for three basic NLP applications (POS tagging, chunking, and language identification) for these closely related languages.

Styles APA, Harvard, Vancouver, ISO, etc.

4

Althobaiti, Maha J. « Creation of annotated country-level dialectal Arabic resources : An unsupervised approach ». Natural Language Engineering, 9 août 2021, 1–42. http://dx.doi.org/10.1017/s135132492100019x.

Texte intégral

Résumé :

Abstract The wide usage of multiple spoken Arabic dialects on social networking sites stimulates increasing interest in Natural Language Processing (NLP) for dialectal Arabic (DA). Arabic dialects represent true linguistic diversity and differ from modern standard Arabic (MSA). In fact, the complexity and variety of these dialects make it insufficient to build one NLP system that is suitable for all of them. In comparison with MSA, the available datasets for various dialects are generally limited in terms of size, genre and scope. In this article, we present a novel approach that automatically develops an annotated country-level dialectal Arabic corpus and builds lists of words that encompass 15 Arabic dialects. The algorithm uses an iterative procedure consisting of two main components: automatic creation of lists for dialectal words and automatic creation of annotated Arabic dialect identification corpus. To our knowledge, our study is the first of its kind to examine and analyse the poor performance of the MSA part-of-speech tagger on dialectal Arabic contents and to exploit that in order to extract the dialectal words. The pointwise mutual information association measure and the geographical frequency of word occurrence online are used to classify dialectal words. The annotated dialectal Arabic corpus (Twt15DA), built using our algorithm, is collected from Twitter and consists of 311,785 tweets containing 3,858,459 words in total. We randomly selected a sample of 75 tweets per country, 1125 tweets in total, and conducted a manual dialect identification task by native speakers. The results show an average inter-annotator agreement score equal to 64%, which reflects satisfactory agreement considering the overlapping features of the 15 Arabic dialects.

Styles APA, Harvard, Vancouver, ISO, etc.

5

Nguyen, Nhung, Roselyn Gabud et Sophia Ananiadou. « COPIOUS : A gold standard corpus of named entities towards extracting species occurrence from biodiversity literature ». Biodiversity Data Journal 7 (22 janvier 2019). http://dx.doi.org/10.3897/bdj.7.e29626.

Texte intégral

Résumé :

Background Species occurrence records are very important in the biodiversity domain. While several available corpora contain only annotations of species names or habitats and geographical locations, there is no consolidated corpus that covers all types of entities necessary for extracting species occurrence from biodiversity literature. In order to alleviate this issue, we have constructed the COPIOUS corpus—a gold standard corpus that covers a wide range of biodiversity entities. Results Two annotators manually annotated the corpus with five categories of entities, i.e. taxon names, geographical locations, habitats, temporal expressions and person names. The overall inter-annotator agreement on 200 doubly-annotated documents is approximately 81.86% F-score. Amongst the five categories, the agreement on habitat entities was the lowest, indicating that this type of entity is complex. The COPIOUS corpus consists of 668 documents downloaded from the Biodiversity Heritage Library with over 26K sentences and more than 28K entities. Named entity recognisers trained on the corpus could achieve an F-score of 74.58%. Moreover, in recognising taxon names, our model performed better than two available tools in the biodiversity domain, namely the SPECIES tagger and the Global Name Recognition and Discovery. More than 1,600 binary relations of Taxon-Habitat, Taxon-Person, Taxon-Geographical locations and Taxon-Temporal expressions were identified by applying a pattern-based relation extraction system to the gold standard. Based on the extracted relations, we can produce a knowledge repository of species occurrences. Conclusion The paper describes in detail the construction of a gold standard named entity corpus for the biodiversity domain. An investigation of the performance of named entity recognition (NER) tools trained on the gold standard revealed that the corpus is sufficiently reliable and sizeable for both training and evaluation purposes. The corpus can be further used for relation extraction to locate species occurrences in literature—a useful task for monitoring species distribution and preserving the biodiversity.

Styles APA, Harvard, Vancouver, ISO, etc.

Thèses sur le sujet "Inter-tagger agreement"

1

GAGLIARDI, GLORIA. « Validazione dell’ontologia dell’azione IMAGACT per lo studio e la diagnosi del Mild Cognitive Impairment (MCI) ». Doctoral thesis, 2014. http://hdl.handle.net/2158/863903.

Texte intégral

Résumé :

Il volume è strutturato in due parti: la prima, di semantica e linguistica computazionale, ha come argomento l’ontologia interlinguistica dell’azione IMAGACT (capitoli 1, 2 e 3); la seconda è invece dedicata alle applicazioni della risorsa (capitoli 4 e 5). Dopo un breve preambolo, nel capitolo 1 viene presentato il progetto IMAGACT, del quale sono descritti i principi metodologici ed il workflow complessivo. Il progetto ha prodotto un’ontologia interlinguistica che rende esplicito lo spettro di variazione pragmatica associata ai predicati azionali a media ed alta frequenza in italiano ed inglese. Le classi di azioni che rappresentano le entità di riferimento dei concetti linguistici, indotte da corpora di parlato da linguisti madrelingua, sono rappresentate in tale risorsa lessicale nella forma di scene prototipiche (Rosch, 1978, 1999). La metodologia sfrutta la capacità dell’utente/apprendente di trovare somiglianze tra immagini diverse indipendentemente dal linguaggio, sostituendo alla tradizionale definizione semantica, spesso sottodeterminata e linguo-specifica, il riconoscimento e l’identificazione dei tipi azionali. Il capitolo 2 è dedicato alla procedura di annotazione, che ha permesso l’estrazione bottom-up delle classi di azioni che costituiscono l’ontologia. Vengono presentati i dati linguistici di partenza (corpus IMAGACT-IT), nonché l’interfaccia informatica che ha reso possibile lo svolgimento dei vari task. Parallelamente, vengono presentate le linee guida per la standardizzazione delle occorrenze, la loro tipizzazione e l’annotazione delle proprietà linguistiche (struttura tematica, alternanze argomentali, aktionsart). Il capitolo 3 affronta invece la procedura di mapping inter-/intra- linguistico: dopo averne descritto i fondamenti teorici e aver individuato le questioni poste dai dati linguistici in input all’architettura della struttura dati, viene illustrata la metodologia che ha portato alla produzione del database 1.0. Nel capitolo 4 viene proposta una procedura di validazione dei dati linguistici, passaggio imprescindibile per consentirne l’applicabilità in ambiti differenti da quello in cui sono stati prodotti. Viene inoltre fornito un ampio quadro bibliografico sul concetto di “inter-rater agreement”, dando conto degli aspetti critici della valutazione dei dati e dei coefficienti statistici più diffusi. Per contestualizzare, vengono inoltre presentati i risultati di un esteso spoglio bibliografico sui principali studi e sulle maggiori campagne di valutazione nel campo della semantica lessicale. Il metodo viene testato su un sottoinsieme di lemmi coeso dal punto di vista semantico. Il capitolo 5 presenta infine la batteria di test SMAAV (Semantic Memory Assessment on Action Verb), creata dall’autrice a partire dai dati IMAGACT validati nel capitolo 4. La classe grammaticale del verbo, che di solito nelle lingue naturali lessicalizza le azioni, è stata tradizionalmente meno studiata ed utilizzata rispetto a quella del nome in ambito psicometrico. I materiali messi a disposizione dall’ontologia rappresentano una assoluta novità nel settore: la conoscenza sistematica del lessico verbale azionale rende possibile l’utilizzo di un campionario di lemmi ampio e al tempo stesso controllato. Inoltre, a differenza dei materiali utilizzati comunemente nella pratica medica, di tipo statico, gli stimoli multimediali messi a disposizione da IMAGACT garantiscono una rappresentazione migliore, più “ecologica”, delle azioni. Sebbene il test possa essere usato come misura generale dell’acceso al lessico e dell’erosione della memoria semantica, ne viene proposto un impiego diagnostico specifico, lo studio del Mild Cognitive Impairment, un settore di ricerca che ha ricevuto attenzione crescente negli ultimi anni. Dopo aver presentato le caratteristiche del disturbo ed avere illustrato i materiali testologici a disposizione di clinici e ricercatori, viene descritto il processo di realizzazione e taratura della batteria. La sezione si conclude con un’analisi linguistica e gestuale delle risposte date dagli informanti nel corso delle sessioni di test. In una breve conclusione vengono infine riassunti i risultati ottenuti e sono indicate alcune possibilità di sviluppo futuro.

Styles APA, Harvard, Vancouver, ISO, etc.

Nous offrons des réductions sur tous les plans premium pour les auteurs dont les œuvres sont incluses dans des sélections littéraires thématiques. Contactez-nous pour obtenir un code promo unique!