Journal articles on the topic 'METRICA CORPOREA'

To see the other types of publications on this topic, follow the link: METRICA CORPOREA.

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'METRICA CORPOREA.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Karamanis, Nikiforos, Chris Mellish, Massimo Poesio, and Jon Oberlander. "Evaluating Centering for Information Ordering Using Corpora." Computational Linguistics 35, no. 1 (March 2009): 29–46. http://dx.doi.org/10.1162/coli.07-036-r2-06-22.

Full text
Abstract:
In this article we discuss several metrics of coherence defined using centering theory and investigate the usefulness of such metrics for information ordering in automatic text generation. We estimate empirically which is the most promising metric and how useful this metric is using a general methodology applied on several corpora. Our main result is that the simplest metric (which relies exclusively on NOCB transitions) sets a robust baseline that cannot be outperformed by other metrics which make use of additional centering-based features. This baseline can be used for the development of both text-to-text and concept-to-text generation systems.
APA, Harvard, Vancouver, ISO, and other styles
2

Franco, Manuel, Juana María Vivo, Manuel Quesada-Martínez, Astrid Duque-Ramos, and Jesualdo Tomás Fernández-Breis. "Evaluation of ontology structural metrics based on public repository data." Briefings in Bioinformatics 21, no. 2 (February 4, 2019): 473–85. http://dx.doi.org/10.1093/bib/bbz009.

Full text
Abstract:
Abstract The development and application of biological ontologies have increased significantly in recent years. These ontologies can be retrieved from different repositories, which do not provide much information about quality aspects of the ontologies. In the past years, some ontology structural metrics have been proposed, but their validity as measurement instrument has not been sufficiently studied to date. In this work, we evaluate a set of reproducible and objective ontology structural metrics. Given the lack of standard methods for this purpose, we have applied an evaluation method based on the stability and goodness of the classifications of ontologies produced by each metric on an ontology corpus. The evaluation has been done using ontology repositories as corpora. More concretely, we have used 119 ontologies from the OBO Foundry repository and 78 ontologies from AgroPortal. First, we study the correlations between the metrics. Second, we study whether the clusters for a given metric are stable and have a good structure. The results show that the existing correlations are not biasing the evaluation, there are no metrics generating unstable clusterings and all the metrics evaluated provide at least reasonable clustering structure. Furthermore, our work permits to review and suggest the most reliable ontology structural metrics in terms of stability and goodness of their classifications. Availability: http://sele.inf.um.es/ontology-metrics
APA, Harvard, Vancouver, ISO, and other styles
3

Bohas, Georges, and Djamel Eddine Kouloughli. "Towards a systematic corpus analysis of Arabic poetry." Linguistic Approaches to Poetry 15 (December 31, 2001): 103–12. http://dx.doi.org/10.1075/bjl.15.08boh.

Full text
Abstract:
Recent work on Arabic metrics aims at developing a coherent research programme which relies on the systematic analysis of electronic corpora. The computer program XALIYL performs, for any line of ancient Arabic poetry, an automatic recognition of the metre used. This operation takes place whatever the length of the verses, and regardless of whether they are encoded in ordinary Arabic script (with the addition of vowels) or by means of the TRS system, which relates functionally to ordinary Arabic script. XALIYL produces a textual database that contains the syllabic decomposition for each hemistich of each line, as well as its metrical analysis. It can cope not only with the general problems linked to re-syllabification and sandhi, but also with problems of syllabification specific to Arabic metrics. Errors due to the metrical scanning or to the editing of poems can be located automatically. Moreover, by allowing a computerised search for formulae, XALIYL provides significant information on the “formulaic systems” of ancient Arabic poetry.
APA, Harvard, Vancouver, ISO, and other styles
4

KASHEFI, OMID, MOHSEN SHARIFI, and BEHROOZ MINAIE. "A novel string distance metric for ranking Persian respelling suggestions." Natural Language Engineering 19, no. 2 (July 24, 2012): 259–84. http://dx.doi.org/10.1017/s1351324912000186.

Full text
Abstract:
AbstractSpelling errors in digital documents are often caused by operational and cognitive mistakes, or by the lack of full knowledge about the language of the written documents. Computer-assisted solutions can help to detect and suggest replacements. In this paper, we present a new string distance metric for the Persian language to rank respelling suggestions of a misspelled Persian word by considering the effects of keyboard layout on typographical spelling errors as well as the homomorphic and homophonic aspects of words for orthographical misspellings. We also consider the misspellings caused by disregarded diacritics. Since the proposed string distance metric is custom-designed for the Persian language, we present the spelling aspects of the Persian language such as homomorphs, homophones, and diacritics. We then present our statistical analysis of a set of large Persian corpora to identify the causes and the types of Persian spelling errors. We show that the proposed string distance metric has a higher mean average precision and a higher mean reciprocal rank in ranking respelling candidates of Persian misspellings in comparison with other metrics such as the Hamming, Levenshtein, Damerau–Levenshtein, Wagner–Fischer, and Jaro–Winkler metrics.
APA, Harvard, Vancouver, ISO, and other styles
5

Periñán-Pascual, Carlos. "The underpinnings of a composite measure for automatic term extraction." Terminology 21, no. 2 (December 30, 2015): 151–79. http://dx.doi.org/10.1075/term.21.2.02per.

Full text
Abstract:
The corpus-based identification of those lexical units which serve to describe a given specialized domain usually becomes a complex task, where an analysis oriented to the frequency of words and the likelihood of lexical associations is often ineffective. The goal of this article is to demonstrate that a user-adjustable composite metric such as SRC can accommodate to the diversity of domain-specific glossaries to be constructed from small- and medium-sized specialized corpora of non-structured texts. Unlike for most of the research in automatic term extraction, where single metrics are usually combined indiscriminately to produce the best results, SRC is grounded on the theoretical principles of salience, relevance and cohesion, which have been rationally implemented in the three components of this metric.
APA, Harvard, Vancouver, ISO, and other styles
6

DeCastro-Arrazola, Varuṇ. "Testing the robustness of final strictness in verse lines." Studia Metrica et Poetica 5, no. 2 (January 28, 2019): 55–76. http://dx.doi.org/10.12697/smp.2018.5.2.03.

Full text
Abstract:
In the field of metrics, it has long been observed that verse lines tend to be more regular or restricted towards the end (Arnold 1905). This has led to the Strict End Hypothesis [SEH], which proposes a general versification principle of universal scope (Hayes 1983). This paper argues that two main challenges hinder the substantiation of the SEH in a broad typological sample of unrelated verse corpora. First, the concept of strictness is too coarse and needs to be narrowed down to testable features or subcomponents. Second, explicit measures need to be developed which enable the systematic comparison of corpora, particularly when trying to capture potentially gradient features such as the relative faithfulness to a metrical template. This study showcases how to overcome these issues by analysing the entropy at different positions in the line for corpora in five languages (English, Dutch, Sanskrit, Estonian, Berber). Finally, I argue that, if the SEH is shown to be typologically robust, shared human cognitive features may provide a partial explanation for this puzzling asymmetry in verse lines.
APA, Harvard, Vancouver, ISO, and other styles
7

Pietquin, Olivier, and Helen Hastie. "A survey on metrics for the evaluation of user simulations." Knowledge Engineering Review 28, no. 1 (November 28, 2012): 59–73. http://dx.doi.org/10.1017/s0269888912000343.

Full text
Abstract:
AbstractUser simulation is an important research area in the field of spoken dialogue systems (SDSs) because collecting and annotating real human–machine interactions is often expensive and time-consuming. However, such data are generally required for designing, training and assessing dialogue systems. User simulations are especially needed when using machine learning methods for optimizing dialogue management strategies such as Reinforcement Learning, where the amount of data necessary for training is larger than existing corpora. The quality of the user simulation is therefore of crucial importance because it dramatically influences the results in terms of SDS performance analysis and the learnt strategy. Assessment of the quality of simulated dialogues and user simulation methods is an open issue and, although assessment metrics are required, there is no commonly adopted metric. In this paper, we give a survey of User Simulations Metrics in the literature, propose some extensions and discuss these metrics in terms of a list of desired features.
APA, Harvard, Vancouver, ISO, and other styles
8

Chee, Qian Wen, Keng Ji Chow, Winston D. Goh, and Melvin J. Yap. "LexiCAL: A calculator for lexical variables." PLOS ONE 16, no. 4 (April 30, 2021): e0250891. http://dx.doi.org/10.1371/journal.pone.0250891.

Full text
Abstract:
While a number of tools have been developed for researchers to compute the lexical characteristics of words, extant resources are limited in their useability and functionality. Specifically, some tools require users to have some prior knowledge of some aspects of the applications, and not all tools allow users to specify their own corpora. Additionally, current tools are also limited in terms of the range of metrics that they can compute. To address these methodological gaps, this article introduces LexiCAL, a fast, simple, and intuitive calculator for lexical variables. Specifically, LexiCAL is a standalone executable that provides options for users to calculate a range of theoretically influential surface, orthographic, phonological, and phonographic metrics for any alphabetic language, using any user-specified input, corpus file, and phonetic system. LexiCAL also comes with a set of well-documented Python scripts for each metric, that can be reproduced and/or modified for other research purposes.
APA, Harvard, Vancouver, ISO, and other styles
9

Yan, Jianwei. "Morphology and word order in Slavic languages: Insights from annotated corpora." Voprosy Jazykoznanija, no. 4 (2021): 131. http://dx.doi.org/10.31857/0373-658x.2021.4.131-159.

Full text
Abstract:
Slavic languages are generally assumed to possess rich morphological features with free syntactic word order. Exploring this complexity trade-off can help us better understand the relationship between morphology and syntax within natural languages. However, few quantitative investigations have been carried out into this relationship within Slavic languages. Based on 34 annotated corpora from Universal Dependencies, this paper paid special attention to the correlations between morphology and syntax within Slavic languages by applying two metrics of morphological richness and two of word order freedom, respectively. Our findings are as follows. First, the quantitative metrics adopted can well capture the distributions of morphological richness and word order freedom of languages. Second, the metrics can corroborate the correlation between morphological richness and word order freedom. Within Slavic languages, this correlation is moderate and statistically significant. Precisely, the richer the morphology, the less strict the word order. Third, Slavic languages can be clustered into three subgroups based on classification models. Most importantly, ancient Slavic languages are characterized by richer morphology and more flexible word order than modern ones. Fourth, as two possible disturbing factors, corpus size does not greatly affect the results of the metrics, whereas corpus genre does play an important part in the measurements of word order freedom. Specifically, the word order of formal written genres tends to be more rigid than that of informal written and spoken ones. Overall, based on annotated corpora, the results verify the negative correlation between morphological richness and word order rigidity within Slavic languages, which might shed light on the dynamic relations between morphology and syntax of natural languages and provide quantitative instantiations of how languages encode lexical and syntactic information for the purpose of efficient communication.
APA, Harvard, Vancouver, ISO, and other styles
10

Hardie, Andrew. "Part-of-speech ratios in English corpora." International Journal of Corpus Linguistics 12, no. 1 (March 16, 2007): 55–81. http://dx.doi.org/10.1075/ijcl.12.1.05har.

Full text
Abstract:
Using part-of-speech (POS) tagged corpora, Hudson (1994) reports that approximately 37% of English tokens are nouns, where ‘noun’ is a superordinate category including nouns, pronouns and other word-classes. It is argued here that difficulties relating to the boundaries of Hudson’s ‘noun’ category demonstrate that there is no uncontroversial way to derive such a superordinate category from POS tagging. Decisions regarding the boundary of the ‘noun’ category have small but statistically significant effects on the ratio that emerges for ‘nouns’ as a whole. Tokenisation and categorisation differences between tagging schemes make it problematic to compare the ratio of ‘nouns’ across different tagsets. The precise figures for POS ratios are therefore effectively artefacts of the tagset. However, these objections to the use of POS ratios do not apply to their use as a metric of variation for comparing datasets tagged with the same tagging scheme.
APA, Harvard, Vancouver, ISO, and other styles
11

Sakaguchi, Keisuke, Courtney Napoles, Matt Post, and Joel Tetreault. "Reassessing the Goals of Grammatical Error Correction: Fluency Instead of Grammaticality." Transactions of the Association for Computational Linguistics 4 (December 2016): 169–82. http://dx.doi.org/10.1162/tacl_a_00091.

Full text
Abstract:
The field of grammatical error correction (GEC) has grown substantially in recent years, with research directed at both evaluation metrics and improved system performance against those metrics. One unvisited assumption, however, is the reliance of GEC evaluation on error-coded corpora, which contain specific labeled corrections. We examine current practices and show that GEC’s reliance on such corpora unnaturally constrains annotation and automatic evaluation, resulting in (a) sentences that do not sound acceptable to native speakers and (b) system rankings that do not correlate with human judgments. In light of this, we propose an alternate approach that jettisons costly error coding in favor of unannotated, whole-sentence rewrites. We compare the performance of existing metrics over different gold-standard annotations, and show that automatic evaluation with our new annotation scheme has very strong correlation with expert rankings (ρ = 0.82). As a result, we advocate for a fundamental and necessary shift in the goal of GEC, from correcting small, labeled error types, to producing text that has native fluency.
APA, Harvard, Vancouver, ISO, and other styles
12

Karyukin, Vladislav, Diana Rakhimova, Aidana Karibayeva, Aliya Turganbayeva, and Asem Turarbek. "The neural machine translation models for the low-resource Kazakh–English language pair." PeerJ Computer Science 9 (February 8, 2023): e1224. http://dx.doi.org/10.7717/peerj-cs.1224.

Full text
Abstract:
The development of the machine translation field was driven by people’s need to communicate with each other globally by automatically translating words, sentences, and texts from one language into another. The neural machine translation approach has become one of the most significant in recent years. This approach requires large parallel corpora not available for low-resource languages, such as the Kazakh language, which makes it difficult to achieve the high performance of the neural machine translation models. This article explores the existing methods for dealing with low-resource languages by artificially increasing the size of the corpora and improving the performance of the Kazakh–English machine translation models. These methods are called forward translation, backward translation, and transfer learning. Then the Sequence-to-Sequence (recurrent neural network and bidirectional recurrent neural network) and Transformer neural machine translation architectures with their features and specifications are concerned for conducting experiments in training models on parallel corpora. The experimental part focuses on building translation models for the high-quality translation of formal social, political, and scientific texts with the synthetic parallel sentences from existing monolingual data in the Kazakh language using the forward translation approach and combining them with the parallel corpora parsed from the official government websites. The total corpora of 380,000 parallel Kazakh–English sentences are trained on the recurrent neural network, bidirectional recurrent neural network, and Transformer models of the OpenNMT framework. The quality of the trained model is evaluated with the BLEU, WER, and TER metrics. Moreover, the sample translations were also analyzed. The RNN and BRNN models showed a more precise translation than the Transformer model. The Byte-Pair Encoding tokenization technique showed better metrics scores and translation than the word tokenization technique. The Bidirectional recurrent neural network with the Byte-Pair Encoding technique showed the best performance with 0.49 BLEU, 0.51 WER, and 0.45 TER.
APA, Harvard, Vancouver, ISO, and other styles
13

Goel, Anmol, and Ponnurangam Kumaraguru. "Detecting Lexical Semantic Change across Corpora with Smooth Manifolds (Student Abstract)." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 18 (May 18, 2021): 15783–84. http://dx.doi.org/10.1609/aaai.v35i18.17888.

Full text
Abstract:
Comparing two bodies of text and detecting words with significant lexical semantic shift between them is an important part of digital humanities. Traditional approaches have relied on aligning the different embeddings using the Orthogonal Procrustes problem in the Euclidean space. This study presents a geometric framework that leverages smooth Riemannian manifolds for corpus-specific orthogonal rotations and a corpus-independent scaling metric to project the different vector spaces into a shared latent space. This enables us to capture any affine relationship between the embedding spaces while utilising the rich geometry of smooth manifolds.
APA, Harvard, Vancouver, ISO, and other styles
14

IOSIF, ELIAS, and ALEXANDROS POTAMIANOS. "Similarity computation using semantic networks created from web-harvested data." Natural Language Engineering 21, no. 1 (July 26, 2013): 49–79. http://dx.doi.org/10.1017/s1351324913000144.

Full text
Abstract:
AbstractWe investigate language-agnostic algorithms for the construction of unsupervised distributional semantic models using web-harvested corpora. Specifically, a corpus is created from web document snippets, and the relevant semantic similarity statistics are encoded in a semantic network. We propose the notion of semantic neighborhoods that are defined using co-occurrence or context similarity features. Three neighborhood-based similarity metrics are proposed, motivated by the hypotheses of attributional and maximum sense similarity. The proposed metrics are evaluated against human similarity ratings achieving state-of-the-art results.
APA, Harvard, Vancouver, ISO, and other styles
15

Silveira, R., V. Furtado, and V. Pinheiro. "Learning keyphrases from corpora and knowledge models." Natural Language Engineering 26, no. 3 (September 10, 2019): 293–318. http://dx.doi.org/10.1017/s1351324919000342.

Full text
Abstract:
AbstractExtraction keyphrase systems traditionally use classification algorithms and do not consider the fact that part of the keyphrases may not be found in the text, reducing the accuracy of such algorithms a priori. In this work, we propose to improve the accuracy of these systems with inferential mechanisms that use a knowledge representation model, including symbolic models of knowledge bases and distributional semantics, to expand the set of keyphrase candidates to be submitted to the classification algorithm with terms that are not in the text (not-in-text terms). The basic assumption we have is that not-in-text terms have a semantic relationship with terms that are in the text. To represent this relationship, we have defined two new features to be represented as input to the classification algorithms. The first feature refers to the power of discrimination of the inferred not-in-text terms. The intuition behind this is that good candidates for a keyphrase are those that are deduced from various textual terms in a specific document and that are not often deduced in other documents. The other feature represents the descriptive strength of a not-in-text candidate. We argue that not-in-text keyphrases must have a strong semantic relationship with the text and that the power of this semantic relationship can be measured in a similar way as popular metrics like TFxIDF. The method proposed in this work was compared with state-of-the-art systems using five corpora and the results show that it has significantly improved automatic keyphrase extraction, dealing with the limitation of extracting keyphrases absent from the text.
APA, Harvard, Vancouver, ISO, and other styles
16

Dziri, Nouha, Hannah Rashkin, Tal Linzen, and David Reitter. "Evaluating Attribution in Dialogue Systems: The BEGIN Benchmark." Transactions of the Association for Computational Linguistics 10 (2022): 1066–83. http://dx.doi.org/10.1162/tacl_a_00506.

Full text
Abstract:
Abstract Knowledge-grounded dialogue systems powered by large language models often generate responses that, while fluent, are not attributable to a relevant source of information. Progress towards models that do not exhibit this issue requires evaluation metrics that can quantify its prevalence. To this end, we introduce the Benchmark for Evaluation of Grounded INteraction (Begin), comprising 12k dialogue turns generated by neural dialogue systems trained on three knowledge-grounded dialogue corpora. We collect human annotations assessing the extent to which the models’ responses can be attributed to the given background information. We then use Begin to analyze eight evaluation metrics. We find that these metrics rely on spurious correlations, do not reliably distinguish attributable abstractive responses from unattributable ones, and perform substantially worse when the knowledge source is longer. Our findings underscore the need for more sophisticated and robust evaluation metrics for knowledge-grounded dialogue. We make Begin publicly available at https://github.com/google/BEGIN-dataset.
APA, Harvard, Vancouver, ISO, and other styles
17

Hashimoto, Tatsunori B., David Alvarez-Melis, and Tommi S. Jaakkola. "Word Embeddings as Metric Recovery in Semantic Spaces." Transactions of the Association for Computational Linguistics 4 (December 2016): 273–86. http://dx.doi.org/10.1162/tacl_a_00098.

Full text
Abstract:
Continuous word representations have been remarkably useful across NLP tasks but remain poorly understood. We ground word embeddings in semantic spaces studied in the cognitive-psychometric literature, taking these spaces as the primary objects to recover. To this end, we relate log co-occurrences of words in large corpora to semantic similarity assessments and show that co-occurrences are indeed consistent with an Euclidean semantic space hypothesis. Framing word embedding as metric recovery of a semantic space unifies existing word embedding algorithms, ties them to manifold learning, and demonstrates that existing algorithms are consistent metric recovery methods given co-occurrence counts from random walks. Furthermore, we propose a simple, principled, direct metric recovery algorithm that performs on par with the state-of-the-art word embedding and manifold learning methods. Finally, we complement recent focus on analogies by constructing two new inductive reasoning datasets—series completion and classification—and demonstrate that word embeddings can be used to solve them as well.
APA, Harvard, Vancouver, ISO, and other styles
18

PERIÑAN-PASCUAL, CARLOS. "DEXTER: A workbench for automatic term extraction with specialized corpora." Natural Language Engineering 24, no. 2 (October 5, 2017): 163–98. http://dx.doi.org/10.1017/s1351324917000365.

Full text
Abstract:
AbstractAutomatic term extraction has become a priority area of research within corpus processing. Despite the extensive literature in this field, there are still some outstanding issues that should be dealt with during the construction of term extractors, particularly those oriented to support research in terminology and terminography. In this regard, this article describes the design and development of DEXTER, an online workbench for the extraction of simple and complex terms from domain-specific corpora in English, French, Italian and Spanish. In this framework, three issues contribute to placing the most important terms in the foreground. First, unlike the elaborate morphosyntactic patterns proposed by most previous research, shallow lexical filters have been constructed to discard term candidates. Second, a large number of common stopwords are automatically detected by means of a method that relies on the IATE database together with the frequency distribution of the domain-specific corpus and a general corpus. Third, the term-ranking metric, which is grounded on the notions of salience, relevance and cohesion, is guided by the IATE database to display an adequate distribution of terms.
APA, Harvard, Vancouver, ISO, and other styles
19

Moro, Gianluca, and Lorenzo Valgimigli. "Efficient Self-Supervised Metric Information Retrieval: A Bibliography Based Method Applied to COVID Literature." Sensors 21, no. 19 (September 26, 2021): 6430. http://dx.doi.org/10.3390/s21196430.

Full text
Abstract:
The literature on coronaviruses counts more than 300,000 publications. Finding relevant papers concerning arbitrary queries is essential to discovery helpful knowledge. Current best information retrieval (IR) use deep learning approaches and need supervised training sets with labeled data, namely to know a priori the queries and their corresponding relevant papers. Creating such labeled datasets is time-expensive and requires prominent experts’ efforts, resources insufficiently available under a pandemic time pressure. We present a new self-supervised solution, called SUBLIMER, that does not require labels to learn to search on corpora of scientific papers for most relevant against arbitrary queries. SUBLIMER is a novel efficient IR engine trained on the unsupervised COVID-19 Open Research Dataset (CORD19), using deep metric learning. The core point of our self-supervised approach is that it uses no labels, but exploits the bibliography citations from papers to create a latent space where their spatial proximity is a metric of semantic similarity; for this reason, it can also be applied to other domains of papers corpora. SUBLIMER, despite is self-supervised, outperforms the Precision@5 (P@5) and Bpref of the state-of-the-art competitors on CORD19, which, differently from our approach, require both labeled datasets and a number of trainable parameters that is an order of magnitude higher than our.
APA, Harvard, Vancouver, ISO, and other styles
20

Kim, Jooyeon, Dongwoo Kim, and Alice Oh. "Joint Modeling of Topics, Citations, and Topical Authority in Academic Corpora." Transactions of the Association for Computational Linguistics 5 (December 2017): 191–204. http://dx.doi.org/10.1162/tacl_a_00055.

Full text
Abstract:
Much of scientific progress stems from previously published findings, but searching through the vast sea of scientific publications is difficult. We often rely on metrics of scholarly authority to find the prominent authors but these authority indices do not differentiate authority based on research topics. We present Latent Topical-Authority Indexing (LTAI) for jointly modeling the topics, citations, and topical authority in a corpus of academic papers. Compared to previous models, LTAI differs in two main aspects. First, it explicitly models the generative process of the citations, rather than treating the citations as given. Second, it models each author’s influence on citations of a paper based on the topics of the cited papers, as well as the citing papers. We fit LTAI into four academic corpora: CORA, Arxiv Physics, PNAS, and Citeseer. We compare the performance of LTAI against various baselines, starting with the latent Dirichlet allocation, to the more advanced models including author-link topic model and dynamic author citation topic model. The results show that LTAI achieves improved accuracy over other similar models when predicting words, citations and authors of publications.
APA, Harvard, Vancouver, ISO, and other styles
21

Serban, Iulian Vlad, Ryan Lowe, Peter Henderson, Laurent Charlin, and Joelle Pineau. "A Survey of Available Corpora For Building Data-Driven Dialogue Systems: The Journal Version." Dialogue & Discourse 9, no. 1 (May 11, 2018): 1–49. http://dx.doi.org/10.5087/dad.2018.101.

Full text
Abstract:
During the past decade, several areas of speech and language understanding have witnessed substantial breakthroughs from the use of data-driven models. In the area of dialogue systems, the trend is less obvious, and most practical systems are still built through significant engineering and expert knowledge. Nevertheless, several recent results suggest that data-driven approaches are feasible and quite promising. To facilitate research in this area, we have carried out a wide survey of publicly available datasets suitable for data-driven learning of dialogue systems. We discuss important characteristics of these datasets, how they can be used to learn diverse dialogue strategies, and their other potential uses. We also examine methods for transfer learning between datasets and the use of external knowledge. Finally, we discuss appropriate choice of evaluation metrics for the learning objective.
APA, Harvard, Vancouver, ISO, and other styles
22

Mikhailova, Elena, Polina Diurdeva, and Dmitry Shalymov. "N-Gram Based Approach for Text Authorship Classification." International Journal of Embedded and Real-Time Communication Systems 8, no. 2 (July 2017): 24–39. http://dx.doi.org/10.4018/ijertcs.2017070102.

Full text
Abstract:
Automated authorship attribution is actual to identify the author of an anonymous texts, or texts whose authorship is in doubt. It can be used in various applications including author verification, plagiarism detection, computer forensics and others. In this article, the authors analyze an approach based on frequency combination of letters is investigated for solving such a task as classification of documents by authorship. This technique could be used to identify the author of a computer program from a predefined set of possible authors. The effectiveness of this approach is significantly determined by the choice of metric. The research examines and compares four different distance measures between a text of unknown authorship and an authors' profile: L1 measure, Kullback-Leibler divergence, base metric of Common N-gram method (CNG) and a certain variation of dissimilarity measure of CNG method. Comparison outlines cases when some metric outperforms others with a specific parameter combination. Experiments are conducted on different Russian and English corpora.
APA, Harvard, Vancouver, ISO, and other styles
23

Walkden, George. "The HeliPaD." International Journal of Corpus Linguistics 21, no. 4 (November 28, 2016): 559–71. http://dx.doi.org/10.1075/ijcl.21.4.05wal.

Full text
Abstract:
This short paper introduces the HeliPaD, a new parsed corpus of Old Saxon (Old Low German). It is annotated according to the standards of the Penn Corpora of Historical English, enriched with lemmatization and additional morphological attributes as well as textual and metrical annotation. This paper provides an overview of its main features and compares it to existing resources such as the Deutsch Diachron Digital version of the Old Saxon Heliand as part of the Referenzkorpus Altdeutsch. It closes with a roadmap for planned future expansions.
APA, Harvard, Vancouver, ISO, and other styles
24

Longhi, Julien. "Proposals for a Discourse Analysis Practice Integrated into Digital Humanities: Theoretical Issues, Practical Applications, and Methodological Consequences." Languages 5, no. 1 (January 20, 2020): 5. http://dx.doi.org/10.3390/languages5010005.

Full text
Abstract:
In this article, I put forward a linguistic analysis model for analyzing meaning which is based on a methodology that falls within the wider framework of the digital humanities and is equipped with digital tools that meet the theoretical requirements stated. First, I propose a conception of the digital humanities which favors a close relationship between digital technology and the humanities. This general framework will justify the use of a number of models embodied in a dynamic conception of language. This dynamism will then be reflected in the choice of metrics and textual analysis tools (developed in the field of textometry, especially the Iramuteq software). The semantic functioning of linguistic units will be described by using these tools within the identified methodological framework and will help to better understand the processes of variations, whether temporal or generic, within vast discursive corpora. I propose a way of analyzing corpora with specific tools, confronting the humanities with computing/numerical technology.
APA, Harvard, Vancouver, ISO, and other styles
25

Wolk, Christoph, and Benedikt Szmrecsanyi. "Probabilistic corpus-based dialectometry." Journal of Linguistic Geography 6, no. 1 (April 2018): 56–75. http://dx.doi.org/10.1017/jlg.2018.6.

Full text
Abstract:
Researchers in dialectometry have begun to explore measurements based on fundamentally quantitative metrics, often sourced from dialect corpora, as an alternative to the traditional signals derived from dialect atlases. This change of data type amplifies an existing issue in the classical paradigm, namely that locations may vary in coverage and that this affects the distance measurements: pairs involving a location with lower coverage suffer from greater noise and therefore imprecision. We propose a method for increasing robustness using generalized additive modeling, a statistical technique that allows leveraging the spatial arrangement of the data. The technique is applied to data from the British English dialect corpus FRED; the results are evaluated regarding their interpretability and according to several quantitative metrics. We conclude that data availability is an influential covariate in corpus-based dialectometry and beyond, and recommend that researchers be aware of this issue and of methods to alleviate it.
APA, Harvard, Vancouver, ISO, and other styles
26

Temperley, David. "Modeling Common-Practice Rhythm." Music Perception 27, no. 5 (June 1, 2010): 355–76. http://dx.doi.org/10.1525/mp.2010.27.5.355.

Full text
Abstract:
THIS STUDY EXPLORES WAYS OF MODELING the compositional processes involved in common-practice rhythm (as represented by European classical music and folk music). Six probabilistic models of rhythm were evaluated using the method of cross-entropy: according to this method, the best model is the one that assigns the highest probability to the data. Two corpora were used: a corpus of European folk songs (the Essen Folksong Collection) and a corpus of Mozart and Haydn string quartets. The model achieving lowest cross-entropy was the First-Order Metrical Duration Model, which chooses a metrical position for each note conditional on the position of the previous note. Second best was the Hierarchical Position Model, which decides at each beat whether or not to generate a note there, conditional on the note status of neighboring strong beats (i.e., whether or not they contain notes).When complexity (number of parameters) is also considered, it is argued that the Hierarchical Position Model is preferable overall.
APA, Harvard, Vancouver, ISO, and other styles
27

Busso, Lucia. "An investigation of the lexico-grammatical profile of English legal- lay language." Language and Law=Linguagem e Direito 9, no. 1 (2022): 146–84. http://dx.doi.org/10.21747/21833745/lanlaw/9_1a7.

Full text
Abstract:
The article presents a study on the lexico-grammar of the genre ofEnglish legal-lay language (Tiersma 1999), using the English subcorpus of theCorIELLS corpus (Busso forthcoming). The study explores four grammaticalconstructions (in Goldberg 2006’s Construction Grammar sense): nominalisationsheading prepositional phrase attachments, modal verb constructions, participialreduced relative constructions, and passive constructions. Specifically, we usecollostructional analysis (Stefanowitsch 2013), followed by a vocabulary analysisusing English core vocabulary as a reference (Brezina and Gablasova 2015), anda comparative frequency analysis with corpora of legal language and general-domain written prose. Results of this first part of the study foreground how legal-lay language is quantitatively different from both neighbouring genres, suggestingthat it might be considered a “blended” genre. We further explore the data in termsof accessibility for speakers, using readability metrics and a survey on Englishparticipants. Both methods show that legal-lay language is at an intermediatelevel of complexity between legal jargon and general-domain prose; however, wefurther note that readability metrics generally underestimate speakers’ ability tocomprehend legal-lay language.
APA, Harvard, Vancouver, ISO, and other styles
28

Xu, Wei, Courtney Napoles, Ellie Pavlick, Quanze Chen, and Chris Callison-Burch. "Optimizing Statistical Machine Translation for Text Simplification." Transactions of the Association for Computational Linguistics 4 (December 2016): 401–15. http://dx.doi.org/10.1162/tacl_a_00107.

Full text
Abstract:
Most recent sentence simplification systems use basic machine translation models to learn lexical and syntactic paraphrases from a manually simplified parallel corpus. These methods are limited by the quality and quantity of manually simplified corpora, which are expensive to build. In this paper, we conduct an in-depth adaptation of statistical machine translation to perform text simplification, taking advantage of large-scale paraphrases learned from bilingual texts and a small amount of manual simplifications with multiple references. Our work is the first to design automatic metrics that are effective for tuning and evaluating simplification systems, which will facilitate iterative development for this task.
APA, Harvard, Vancouver, ISO, and other styles
29

Kokkinakis, Dimitrios. "PP-Attachment Disambiguation for Swedish: Combining Unsupervised and Supervised Training Data." Nordic Journal of Linguistics 23, no. 2 (December 2000): 191–213. http://dx.doi.org/10.1080/033258600750061518.

Full text
Abstract:
Structural ambiguity, particularly attachment of prepositional phrases, is a serious type of global ambiguity in Natural Language. The disambiguation becomes crucial when a syntactic analyzer must make the correct decision among at least two equally grammatical parse-trees for the same sentence. This paper attempts to find answers to the problem of how attachment ambiguity can be resolved by utilizing Machine Learning (ML) techniques. ML is founded on the assumption that the performance in cognitive tasks is based on the similarity of new situations (testing) to stored representations of earlier experiences (training). Therefore, a large amount of training data is an important prerequisite for providing a solution to the problem. A combination of unsupervised and restricted supervised acquisition of such data will be reported. Training is performed both on a subset of the content of the Gothenburg Lexical Database (GLDB), and on instances of large corpora annotated with coarse-grained semantic information. Testing is performed on corpora instances using a range of different algorithms and metrics. The application language is written Swedish.
APA, Harvard, Vancouver, ISO, and other styles
30

Faria, Pablo. "Learning parts-of-speech through distributional analysis. Further results from Brazilian Portuguese." Diacrítica 33, no. 2 (December 16, 2019): 229–51. http://dx.doi.org/10.21814/diacritica.415.

Full text
Abstract:
A model of part-of-speech (or syntactic category) learning through distributional analysis – as a task in the language acquisition process – is presented here. It is based on Redington et al.'s (1998) model, but the input data used comes from publicly available corpora of both child-directed speech and speech between adults in Brazilian Portuguese. Results from four (out of nine) experiments are presented and discussed: experiments 2, 3, 4, and 7, of the original study. These experiments investigate variables such as the number of target and context words (2) and corpus size (4), and evaluate also the value of distributional information for different categories (3) and the learner's performance when functional categories are removed from the corpus (7). In general, our results support Redington et al.'s, although we find some few differences. We also evaluate the cosine metric, comparing it with performance obtained with the spearman rank correlation metric used in Redington et al.'s study. The latter seems to produce better results.
APA, Harvard, Vancouver, ISO, and other styles
31

Liu, Kanglong, Rongguang Ye, Liu Zhongzhu, and Rongye Ye. "Entropy-based discrimination between translated Chinese and original Chinese using data mining techniques." PLOS ONE 17, no. 3 (March 24, 2022): e0265633. http://dx.doi.org/10.1371/journal.pone.0265633.

Full text
Abstract:
The present research reports on the use of data mining techniques for differentiating between translated and non-translated original Chinese based on monolingual comparable corpora. We operationalized seven entropy-based metrics including character, wordform unigram, wordform bigram and wordform trigram, POS (Part-of-speech) unigram, POS bigram and POS trigram entropy from two balanced Chinese comparable corpora (translated vs non-translated) for data mining and analysis. We then applied four data mining techniques including Support Vector Machines (SVMs), Linear discriminant analysis (LDA), Random Forest (RF) and Multilayer Perceptron (MLP) to distinguish translated Chinese from original Chinese based on these seven features. Our results show that SVMs is the most robust and effective classifier, yielding an AUC of 90.5% and an accuracy rate of 84.3%. Our results have affirmed the hypothesis that translational language is categorically different from original language. Our research demonstrates that combining information-theoretic indicator of Shannon’s entropy together with machine learning techniques can provide a novel approach for studying translation as a unique communicative activity. This study has yielded new insights for corpus-based studies on the translationese phenomenon in the field of translation studies.
APA, Harvard, Vancouver, ISO, and other styles
32

Day, Peter, and Asoke K. Nandi. "Genetic Programming for Robust Text Independent Speaker Verification." International Journal of Signs and Semiotic Systems 2, no. 2 (July 2012): 1–22. http://dx.doi.org/10.4018/ijsss.2012070101.

Full text
Abstract:
Robust Automatic Speaker Verification has become increasingly desirable in recent years with the growing trend toward remote security verification procedures for telephone banking, bio-metric security measures and similar applications. While many approaches have been applied to this problem, Genetic Programming offers inherent feature selection and solutions that can be meaningfully analyzed, making it well suited for this task. This article introduces a Genetic Programming system to evolve programs capable of speaker verification and evaluates its performance with the publicly available TIMIT corpora. Also presented are the effects of a simulated telephone network on classification results which highlight the principal advantage, namely robustness to both additive and convolutive noise.
APA, Harvard, Vancouver, ISO, and other styles
33

Goldsmith, John. "Unsupervised Learning of the Morphology of a Natural Language." Computational Linguistics 27, no. 2 (June 2001): 153–98. http://dx.doi.org/10.1162/089120101750300490.

Full text
Abstract:
This study reports the results of using minimum description length (MDL) analysis to model unsupervised learning of the morphological segmentation of European languages, using corpora ranging in size from 5,000 words to 500,000 words. We develop a set of heuristics that rapidly develop a probabilistic morphological grammar, and use MDL as our primary tool to determine whether the modifications proposed by the heuristics will be adopted or not. The resulting grammar matches well the analysis that would be developed by a human morphologist. In the final section, we discuss the relationship of this style of MDL grammatical analysis to the notion of evaluation metric in early generative grammar.
APA, Harvard, Vancouver, ISO, and other styles
34

Bendová, Klára. "Using a parallel corpus to adapt the Flesch Reading Ease formula to Czech." Journal of Linguistics/Jazykovedný casopis 72, no. 2 (December 1, 2021): 477–87. http://dx.doi.org/10.2478/jazcas-2021-0044.

Full text
Abstract:
Abstract Text readability metrics assess how much effort a reader must put into comprehending a given text. They are, e.g., used to choose appropriate readings for different student proficiency levels, or to make sure that crucial information is efficiently conveyed (e.g., in an emergency). Flesch Reading Ease is such a globally used formula that it is even integrated into the MS Word Processor. However, its constants are language-dependent. The original formula was created for English. So far it has been adapted to several European languages, Bangla, and Hindi. This paper describes the Czech adaptation, with the language-dependent constants optimized by a machine-learning algorithm working on parallel corpora of Czech and English, Russian, Italian, and French, respectively.
APA, Harvard, Vancouver, ISO, and other styles
35

Ferreira Cruz, André, Gil Rocha, and Henrique Lopes Cardoso. "Coreference Resolution: Toward End-to-End and Cross-Lingual Systems." Information 11, no. 2 (January 30, 2020): 74. http://dx.doi.org/10.3390/info11020074.

Full text
Abstract:
The task of coreference resolution has attracted considerable attention in the literature due to its importance in deep language understanding and its potential as a subtask in a variety of complex natural language processing problems. In this study, we outlined the field’s terminology, describe existing metrics, their differences and shortcomings, as well as the available corpora and external resources. We analyzed existing state-of-the-art models and approaches, and reviewed recent advances and trends in the field, namely end-to-end systems that jointly model different subtasks of coreference resolution, and cross-lingual systems that aim to overcome the challenges of less-resourced languages. Finally, we discussed the main challenges and open issues faced by coreference resolution systems.
APA, Harvard, Vancouver, ISO, and other styles
36

Ohriner, Mitchell. "Metric Ambiguity and Flow in Rap Music: A Corpus-Assisted Study of Outkast's "Mainstream" (1996)." Empirical Musicology Review 11, no. 2 (January 10, 2017): 153. http://dx.doi.org/10.18061/emr.v11i2.4896.

Full text
Abstract:
Recent years have seen the rise of musical corpus studies, primarily detailing harmonic tendencies of tonal music. This article extends this scholarship by addressing a new genre (rap music) and a new parameter of focus (rhythm). More specifically, I use corpus methods to investigate the relation between metric ambivalence in the instrumental parts of a rap track (i.e., the beat) and an emcee's rap delivery (i.e., the flow). Unlike virtually every other rap track, the instrumental tracks of Outkast's "Mainstream" (1996) simultaneously afford hearing both a four-beat and a three-beat metric cycle. Because three-beat durations between rhymes, phrase endings, and reiterated rhythmic patterns are rare in rap music, an abundance of them within a verse of "Mainstream" suggests that an emcee highlights the three-beat cycle, especially if that emcee is not prone to such durations more generally. Through the construction of three corpora, one representative of the genre as a whole, and two that are artist specific, I show how the emcee T-Mo Goodie's expressive practice highlights the rare three-beat affordances of the track.
APA, Harvard, Vancouver, ISO, and other styles
37

Yu, Zehao. "Two Improved Topic Word Detection Algorithms." International Journal of Software Engineering and Knowledge Engineering 30, no. 08 (August 2020): 1097–118. http://dx.doi.org/10.1142/s0218194020400173.

Full text
Abstract:
Topic word extraction is the task of identifying single or multi-word expressions that represent the main topics of a document. In this paper, two improved algorithms for extracting and discovering topic words are proposed in the Rapid Topic word Detection (RTD) Algorithm and CategoryTextRank (CTextRank) Algorithm, which can effectively obtain information by extracting and filtering the topic words in the text. The algorithms overcome the shortcomings of traditional topic words discovering algorithms that require deep linguistic knowledge, domain or language specific annotated corpora. The two algorithms we proposed can process both short and long text. The biggest advantage of the algorithms is that they are unsupervised machine learning algorithms. They need not be trained to process text directly to get topic words. The Accuracy rate, recall rate and F-measure index have been greatly improved when using the two algorithms which show that the results obtained compare favorably with previously published results on datasets Inspec and SemEval. The first algorithm Rapid Topicword Detection improves the metrics compared to PositionRank and TextRank, the second algorithm CategoryTextRank improves the metrics compared to TextRank, SingleRank and TF-IDF.
APA, Harvard, Vancouver, ISO, and other styles
38

Bonet-Solà, Daniel, and Rosa Ma Alsina-Pagès. "A Comparative Survey of Feature Extraction and Machine Learning Methods in Diverse Acoustic Environments." Sensors 21, no. 4 (February 11, 2021): 1274. http://dx.doi.org/10.3390/s21041274.

Full text
Abstract:
Acoustic event detection and analysis has been widely developed in the last few years for its valuable application in monitoring elderly or dependant people, for surveillance issues, for multimedia retrieval, or even for biodiversity metrics in natural environments. For this purpose, sound source identification is a key issue to give a smart technological answer to all the aforementioned applications. Diverse types of sounds and variate environments, together with a number of challenges in terms of application, widen the choice of artificial intelligence algorithm proposal. This paper presents a comparative study on combining several feature extraction algorithms (Mel Frequency Cepstrum Coefficients (MFCC), Gammatone Cepstrum Coefficients (GTCC), and Narrow Band (NB)) with a group of machine learning algorithms (k-Nearest Neighbor (kNN), Neural Networks (NN), and Gaussian Mixture Model (GMM)), tested over five different acoustic environments. This work has the goal of detailing a best practice method and evaluate the reliability of this general-purpose algorithm for all the classes. Preliminary results show that most of the combinations of feature extraction and machine learning present acceptable results in most of the described corpora. Nevertheless, there is a combination that outperforms the others: the use of GTCC together with kNN, and its results are further analyzed for all the corpora.
APA, Harvard, Vancouver, ISO, and other styles
39

Li, Zhe, Mieradilijiang Maimaiti, Jiabao Sheng, Zunwang Ke, Wushour Silamu, Qinyong Wang, and Xiuhong Li. "An Empirical Study on Deep Neural Network Models for Chinese Dialogue Generation." Symmetry 12, no. 11 (October 23, 2020): 1756. http://dx.doi.org/10.3390/sym12111756.

Full text
Abstract:
The task of dialogue generation has attracted increasing attention due to its diverse downstream applications, such as question-answering systems and chatbots. Recently, the deep neural network (DNN)-based dialogue generation models have achieved superior performance against conventional models utilizing statistical machine learning methods. However, despite that an enormous number of state-of-the-art DNN-based models have been proposed, there lacks detailed empirical comparative analysis for them on the open Chinese corpus. As a result, relevant researchers and engineers might find it hard to get an intuitive understanding of the current research progress. To address this challenge, we conducted an empirical study for state-of-the-art DNN-based dialogue generation models in various Chinese corpora. Specifically, extensive experiments were performed on several well-known single-turn and multi-turn dialogue corpora, including KdConv, Weibo, and Douban, to evaluate a wide range of dialogue generation models that are based on the symmetrical architecture of Seq2Seq, RNNSearch, transformer, generative adversarial nets, and reinforcement learning respectively. Moreover, we paid special attention to the prevalent pre-trained model for the quality of dialogue generation. Their performances were evaluated by four widely-used metrics in this area: BLEU, pseudo, distinct, and rouge. Finally, we report a case study to show example responses generated by these models separately.
APA, Harvard, Vancouver, ISO, and other styles
40

Véliz C., Mauricio. "La Fonología del Foco Contrastivo en la variedad de inglés denominada RP y español de Chile." Literatura y Lingüística, no. 21 (June 26, 2015): 61. http://dx.doi.org/10.29344/0717621x.21.134.

Full text
Abstract:
ResumenEl presente trabajo procura determinar y comparar los mecanismos entonacionales utilizados para establecer contraste entre el inglés RP y español de Chile. Para este fin,se han empleado corpora de habla espontánea del español de Chile y de la variedad RP del inglés. Los enunciados contrastivos fueron sometidos a análisis acústico, empleando un software especializado y el modelo de fonología entonacional Métrico Autosegmental. Las conclusiones más sobresalientes son las siguientes: (i) la marcación prosódica decontraste aparece como un rasgo mayormente predominante en inglés RP que en español chileno; (ii) el español presenta dos patrones que ocurren con cierta frecuencia: (H*+L) y (L+H+, L*+H); sin embargo, en inglés, el uso de (H*) sobrepasa ampliamente e nnúmero los otros patrones también detectados y (iii) en inglés el uso de (H*) se utiliza en más del 50% de los casos detectados.Palabras clave: Foco, foco contrastivo, pico tonal, acento tonal, patrón entonacional.AbstractThis paper attempts to determine and contrast the intonational mechanisms used in RP English and Chilean Spanish to mark contrastiveness. To this end, corpora of spontaneous speech have been used. The utterances were acoustically analysed, making use of the Autosegmental-Metrical model of intonational phonology.Key words: Focus, contrastive focus, peak accent, pitch accent, intonation pattern
APA, Harvard, Vancouver, ISO, and other styles
41

Peng, Rachel X., and Ryan Yang Wang. "Understanding information needs during COVID-19: A comparison study between an online health community and a Q&A platform." Health Informatics Journal 28, no. 4 (October 2022): 146045822211424. http://dx.doi.org/10.1177/14604582221142443.

Full text
Abstract:
This paper aims at identifying user’s information needs on Coronavirus and the differences of user’s information needs between the online health community MedHelp and the question-and-answer forum Quora during the COVID-19 global pandemic. We obtained the posts in the sub-community Coronavirus on MedHelp (195 posts with 1627 answers) and under the topic of COVID-19(2019-2020) on Quora (263 posts with 8401 answers) via web scraping built on Selenium WebDriver. After preprocessing, we conducted topic modeling on both corpora and identified the best topic model for each corpus based on the diagnostic metrics. Leveraging the improved sqrt-cosine similarity measurement, we further compared the topic similarity between these two corpora. This study finds that there are common information needs on both platforms about vaccination and the essential elements of the disease including the onset symptoms, transmission routes, preventive measures, treatment and control of COVID-19. Some unique discussions on MedHelp are about psychological health, and therapeutic management of patients. Users on Quora have special interests of information about the association between vaccine and Luciferase, and attacks on Fauci after email trove released. The work is beneficial for researchers who aim to provide accurate information assistance and build effective online emergence response programs during the pandemic.
APA, Harvard, Vancouver, ISO, and other styles
42

Mikhaylov, D., A. Kozlov, and G. Emelyanov. "An approach based on tf-idf metrics to extract the knowledge and relevant linguistic means on subject-oriented text sets." Computer Optics 39, no. 3 (2015): 429–38. http://dx.doi.org/10.18287/0134-2452-2015-39-3-429-438.

Full text
Abstract:
In this paper we look at a problem of extracting knowledge units from the sets of subject-oriented texts. Each such text set is considered as a corpus. The main practical goal here is finding the most rational variant to express the knowledge fragment in a given natural language for further reflection in the thesaurus and ontology of a subject area. The problem is of importance when constructing systems for processing, analysis, estimation and understanding of information represented, in particular, by images. In this paper, by applying the TF-IDF metrics to classify words of the initial phrase in relation to given text corpora we address the task of selecting phrases closest to the initial one in terms of the described fragment of actual knowledge or forms of its expression in a given natural language.
APA, Harvard, Vancouver, ISO, and other styles
43

Wu, Danny TY, David A. Hanauer, Qiaozhu Mei, Patricia M. Clark, Lawrence C. An, Joshua Proulx, Qing T. Zeng, VG Vinod Vydiswaran, Kevyn Collins-Thompson, and Kai Zheng. "Assessing the readability of ClinicalTrials.gov." Journal of the American Medical Informatics Association 23, no. 2 (August 11, 2015): 269–75. http://dx.doi.org/10.1093/jamia/ocv062.

Full text
Abstract:
Abstract Objective ClinicalTrials.gov serves critical functions of disseminating trial information to the public and helping the trials recruit participants. This study assessed the readability of trial descriptions at ClinicalTrials.gov using multiple quantitative measures. Materials and Methods The analysis included all 165 988 trials registered at ClinicalTrials.gov as of April 30, 2014. To obtain benchmarks, the authors also analyzed 2 other medical corpora: (1) all 955 Health Topics articles from MedlinePlus and (2) a random sample of 100 000 clinician notes retrieved from an electronic health records system intended for conveying internal communication among medical professionals. The authors characterized each of the corpora using 4 surface metrics, and then applied 5 different scoring algorithms to assess their readability. The authors hypothesized that clinician notes would be most difficult to read, followed by trial descriptions and MedlinePlus Health Topics articles. Results Trial descriptions have the longest average sentence length (26.1 words) across all corpora; 65% of their words used are not covered by a basic medical English dictionary. In comparison, average sentence length of MedlinePlus Health Topics articles is 61% shorter, vocabulary size is 95% smaller, and dictionary coverage is 46% higher. All 5 scoring algorithms consistently rated CliniclTrials.gov trial descriptions the most difficult corpus to read, even harder than clinician notes. On average, it requires 18 years of education to properly understand these trial descriptions according to the results generated by the readability assessment algorithms. Discussion and Conclusion Trial descriptions at CliniclTrials.gov are extremely difficult to read. Significant work is warranted to improve their readability in order to achieve CliniclTrials.gov’s goal of facilitating information dissemination and subject recruitment.
APA, Harvard, Vancouver, ISO, and other styles
44

Raff, Edward, Charles Nicholas, and Mark McLean. "A New Burrows Wheeler Transform Markov Distance." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 04 (April 3, 2020): 5444–53. http://dx.doi.org/10.1609/aaai.v34i04.5994.

Full text
Abstract:
Prior work inspired by compression algorithms has described how the Burrows Wheeler Transform can be used to create a distance measure for bioinformatics problems. We describe issues with this approach that were not widely known, and introduce our new Burrows Wheeler Markov Distance (BWMD) as an alternative. The BWMD avoids the shortcomings of earlier efforts, and allows us to tackle problems in variable length DNA sequence clustering. BWMD is also more adaptable to other domains, which we demonstrate on malware classification tasks. Unlike other compression-based distance metrics known to us, BWMD works by embedding sequences into a fixed-length feature vector. This allows us to provide significantly improved clustering performance on larger malware corpora, a weakness of prior methods.
APA, Harvard, Vancouver, ISO, and other styles
45

Manaris, Bill, David Johnson, and Yiorgos Vassilandonakis. "Harmonic Navigator: A Gesture-Driven, Corpus-Based Approach to Music Analysis, Composition, and Performance." Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment 9, no. 5 (June 30, 2021): 67–74. http://dx.doi.org/10.1609/aiide.v9i5.12658.

Full text
Abstract:
We present a novel, real-time system for exploring harmonic spaces of musical styles, to generate music in collaboration with human performers utilizing gesture devices (such as the Kinect) together with MIDI and OSC instruments / controllers. This corpus-based environment incorporates statistical and evolutionary components for exploring potential flows through harmonic spaces, utilizing power-law (Zipf-based) metrics for fitness evaluation. It supports visual exploration and navigation of harmonic transition probabilities through interactive gesture control. These probabilities are computed from musical corpora (in MIDI format). Herein we utilize the Classical Music Archives 14,000+ MIDI corpus, among others. The user interface supports real-time exploration of the balance between predictability and surprise for musical composition and performance, and may be used in a variety of musical contexts and applications.
APA, Harvard, Vancouver, ISO, and other styles
46

Adjeisah, Michael, Guohua Liu, Douglas Omwenga Nyabuga, Richard Nuetey Nortey, and Jinling Song. "Pseudotext Injection and Advance Filtering of Low-Resource Corpus for Neural Machine Translation." Computational Intelligence and Neuroscience 2021 (April 11, 2021): 1–10. http://dx.doi.org/10.1155/2021/6682385.

Full text
Abstract:
Scaling natural language processing (NLP) to low-resourced languages to improve machine translation (MT) performance remains enigmatic. This research contributes to the domain on a low-resource English-Twi translation based on filtered synthetic-parallel corpora. It is often perplexing to learn and understand what a good-quality corpus looks like in low-resource conditions, mainly where the target corpus is the only sample text of the parallel language. To improve the MT performance in such low-resource language pairs, we propose to expand the training data by injecting synthetic-parallel corpus obtained by translating a monolingual corpus from the target language based on bootstrapping with different parameter settings. Furthermore, we performed unsupervised measurements on each sentence pair engaging squared Mahalanobis distances, a filtering technique that predicts sentence parallelism. Additionally, we extensively use three different sentence-level similarity metrics after round-trip translation. Experimental results on a diverse amount of available parallel corpus demonstrate that injecting pseudoparallel corpus and extensive filtering with sentence-level similarity metrics significantly improves the original out-of-the-box MT systems for low-resource language pairs. Compared with existing improvements on the same original framework under the same structure, our approach exhibits tremendous developments in BLEU and TER scores.
APA, Harvard, Vancouver, ISO, and other styles
47

Hocutt, Daniel, Nupoor Ranade, and Gustav Verhulsdonck. "Localizing Content: The Roles of Technical & Professional Communicators and Machine Learning in Personalized Chatbot Responses." Technical Communication 69, no. 4 (November 1, 2022): 114–31. http://dx.doi.org/10.55177/tc148396.

Full text
Abstract:
Purpose: This study demonstrates that microcontent, a snippet of personalized content that responds to users' needs, is a form of localization reliant on a content ecology. In contributing to users' localized experiences, technical communicators should recognize their work as part of an assemblage in which users, content, and metrics augment each other to produce personalized content that can be consumed by and delivered through artificial intelligence (AI)-assisted technology.<br/> Method: We use an exploratory case study on an AI-driven chatbot to demonstrate the assemblage of user, content, metrics, and AI. By understanding assemblage roles and function of different units used to build AI systems, technical and professional communicators can contribute to microcontent development. We define microcontent as a localized form of content deployed by AI and quickly consumed by a human user through online interfaces.<br/> Results: We identify five insertion points where technical communicators can participate in localizing content:<br/> • Creating structured content for bots to better meet user needs<br/> • Training corpora for bots with data-informed user personas that can better address specific needs of user groups<br/> • Developing chatbot user interfaces that are more responsive to user needs<br/> • Developing effective human-in-the-loop approaches by moderating content for refining future human-chatbot interactions<br/> • Creating more ethically and user-centered data practices with different stakeholders.<br/> Conclusion: Technical communicators should teach, research, and practice competencies and skills to advocate for localized users in assemblages of user, content, metrics, and AI.
APA, Harvard, Vancouver, ISO, and other styles
48

Gong, Jufang. "Analysis and Application of the Business English Translation Query and Decision Model with Big Data Corpus." Security and Communication Networks 2022 (September 8, 2022): 1–10. http://dx.doi.org/10.1155/2022/2714079.

Full text
Abstract:
This paper aims to build an English translation query and decision support model using big data corpus and applies it to business English translation. Firstly, the existing convolutional network is improved by using depth-separable convolution, and the input statements are mapped to the depth feature space. Secondly, the attentional mechanism is used to enhance the expressive ability of input sentences in deep feature space. Then, considering the sequential relationship, use long short-term memory (LSTM) neural network as a decoder block to generate the corresponding translation of the input sentence. Finally, nonparametric metric learning module is used to improve the model in an end-to-end way. Wide range of experiments on the multiple corpora have shown the proposed model has better real-time performance while maintaining high precision in translation and query, and it has a certain practical application value.
APA, Harvard, Vancouver, ISO, and other styles
49

Clayton, Martin, Simone Tarsitani, Richard Jankowsky, Luis Jure, Laura Leante, Rainer Polak, Adrian Poole, et al. "The Interpersonal Entrainment in Music Performance Data Collection." Empirical Musicology Review 16, no. 1 (December 10, 2021): 65–84. http://dx.doi.org/10.18061/emr.v16i1.7555.

Full text
Abstract:
The Interpersonal Entrainment in Music Performance Data Collection (IEMPDC) comprises six related corpora of music research materials: Cuban Son & Salsa (CSS), European String Quartet (ESQ), Malian Jembe (MJ), North Indian Raga (NIR), Tunisian Stambeli (TS), and Uruguayan Candombe (UC). The core data for each corpus comprises media files and computationally extracted event onset timing data. Annotation of metrical structure and code used in the preparation of the collection is also shared. The collection is unprecedented in size and level of detail and represents a significant new resource for empirical and computational research in music. In this article we introduce the main features of the data collection and the methods used in its preparation. Details of technical validation procedures and notes on data visualization are available as Appendices. We also contextualize the collection in relation to developments in Open Science and Open Data, discussing important distinctions between the two related concepts.
APA, Harvard, Vancouver, ISO, and other styles
50

Liu, Yang, Qun Liu, and Shouxun Lin. "Discriminative Word Alignment by Linear Modeling." Computational Linguistics 36, no. 3 (September 2010): 303–39. http://dx.doi.org/10.1162/coli_a_00001.

Full text
Abstract:
Word alignment plays an important role in many NLP tasks as it indicates the correspondence between words in a parallel text. Although widely used to align large bilingual corpora, generative models are hard to extend to incorporate arbitrary useful linguistic information. This article presents a discriminative framework for word alignment based on a linear model. Within this framework, all knowledge sources are treated as feature functions, which depend on a source language sentence, a target language sentence, and the alignment between them. We describe a number of features that could produce symmetric alignments. Our model is easy to extend and can be optimized with respect to evaluation metrics directly. The model achieves state-of-the-art alignment quality on three word alignment shared tasks for five language pairs with varying divergence and richness of resources. We further show that our approach improves translation performance for various statistical machine translation systems.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography