Journal articles on the topic 'Corpus compilation'

To see the other types of publications on this topic, follow the link: Corpus compilation.

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Corpus compilation.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Alfraidi, Tareq, Mohammad A. R. Abdeen, Ahmed Yatimi, Reyadh Alluhaibi, and Abdulmohsen Al-Thubaity. "The Saudi Novel Corpus: Design and Compilation." Applied Sciences 12, no. 13 (June 30, 2022): 6648. http://dx.doi.org/10.3390/app12136648.

Full text
Abstract:
Arabic has recently received significant attention from corpus compilers. This situation has led to the creation of many Arabic corpora that cover various genres, most notably the newswire genre. Yet, Arabic novels, and specifically those authored by Saudi writers, lack the sufficient digital datasets that would enhance corpus linguistic and stylistic studies of these works. Thus, Arabic lags behind English and other European languages in this context. In this paper, we present the Saudi Novels Corpus, built to be a valuable resource for linguistic and stylistic research communities. We specifically present the procedures we followed and the decisions we made in creating the corpus. We describe and clarify the design criteria, data collection methods, process of annotation, and encoding. In addition, we present preliminary results that emerged from the analysis of the corpus content. We consider the work described in this paper as initial steps to bridge the existing gap between corpus linguistics and Arabic literary texts. Further work is planned to improve the quality of the corpus by adding advanced features.
APA, Harvard, Vancouver, ISO, and other styles
2

Castillo Rodríguez, Cristina, José María Díaz Lage, and Beatriz Rubio Martínez. "Compiling and analyzing a tagged learner corpus: a corpus-based study of adjective uses." Círculo de Lingüística Aplicada a la Comunicación 81 (February 21, 2020): 115–36. http://dx.doi.org/10.5209/clac.67932.

Full text
Abstract:
A learner corpus (LC) is widely known as a rich source of information regarding the use of expressions and the errors made by students in their productions. In fact, we, as teachers, can profit from the compilation of their tasks so as to analyze in detail their way of writing. However, the mere compilation of texts does not guarantee a successful exploitation, as more steps than saving texts must be involved in the whole process. Therefore, it seems essential to follow a protocolized methodology of compilation. In this paper we propose five phases for compiling a LC containing texts from the spontaneous written productions from undergraduate and postgraduate students. The outcomes thrown with the LC exploitation will reveal the errors in students’ productions regarding the use of plural, comparative and superlative in adjectives and also other fails detected in the tagging phase, most of which are due to students’ misuses.
APA, Harvard, Vancouver, ISO, and other styles
3

Kwon, Heokseung. "English learner corpora and research in Korea." Corpora 17, Supplement (October 2022): 5–22. http://dx.doi.org/10.3366/cor.2022.0244.

Full text
Abstract:
The interest in the exploitation of corpora in the study of Korean L2 learners’ use of English has risen dramatically over the past two decades, leading to the compilation of learner corpora and to numerous empirical investigations into Korean learners’ use of English. This paper will give an overview of the compilation and characteristics of English learner corpora in Korea and will also provide an analysis of the recent trends in learner corpus research. It was not until the mid-2000s that Korean academics started to compile English learner corpora, such as the snu Korean-speaking English Learner Corpus (skelc), the Yonsei English Learner Corpus (yelc), the Gachon Learner Corpus (glc), the Neungyule Interlanguage Corpus of Korean Learners of English (nickle), the efl Teacher Corpus (etc), the Korean English Learners’ Spoken Corpus (kelsc) and the ets Corpus of Non-native Written English (TOEFL11). There have also been a growing number of learner corpus-based studies that used the existing learner corpora as well as self-compiled corpus data. All the learner corpus-based research articles published in two Korean academic journals ( English Teaching and Korean Journal of Applied Linguistics) will be reviewed and analysed in terms of research topics and areas, data types, analysis methods and corpus compilation practices. Finally, this paper will suggest some future directions for learner corpus compilation and research in Korea.
APA, Harvard, Vancouver, ISO, and other styles
4

Llaurado, Anna, Maria Antònia Martí, and Liliana Tolchinsky. "Corpus CesCa." International Journal of Corpus Linguistics 17, no. 3 (December 31, 2012): 428–41. http://dx.doi.org/10.1075/ijcl.17.3.06lla.

Full text
Abstract:
This paper outlines the compilation of a corpus of Catalan written production. The CesCa corpus presents a picture of the Catalan written language throughout compulsory schooling. It contains two kinds of data: Vocabularies of five semantic fields comprising 242,404 lexical forms and Textual data of four different discourse genres consisting of 207,028 tokens. Both vocabularies and the textual data have been morphologically analyzed and lemmatized. The corpus is freely available. This paper will outline the main features of the corpus and make some suggestions as to the uses to which the corpus can be put.
APA, Harvard, Vancouver, ISO, and other styles
5

Monaco, Leida Maria, and Luis Puente-Castelo. "‘A matter both of curioſity and uſefulneſs’: Compiling the Corpus of English Texts on Language." Research in Corpus Linguistics 7 (2019): 47–68. http://dx.doi.org/10.32714/ricl.07.03.

Full text
Abstract:
This paper describes the compilation of CETeL, the subcorpus on ‘Language and Linguistics’ in the Coruña Corpus of English Scientific Writing, and discusses the various challenges encountered during the process of selection and digitisation of material. CETeL includes forty-four samples of texts on Language, Languages, and Linguistics from the period 1700–1900, and on completion will contain around 400,000 words. The paper will examine the historical context of academic writing in that period and the way in which this context affects the process of compilation. Likewise, the criteria followed in the compilation of the Coruña Corpus will be discussed in order to show the extent to which these criteria have affected the compilation of CETeL, and how they contribute towards making the corpus representative of the disciplinary practices of the period. Finally, the corpus will also be described according to a series of parameters used to assure representativeness and balance, namely the date of publication of samples, their genre, and the sex and linguistic background of their authors.
APA, Harvard, Vancouver, ISO, and other styles
6

Ó Meachair, Mícheál J., Brian Ó Raghallaigh, Úna Bhreathnach, Gearóid Ó Cleircín, and Kevin Scannell. "Tiomsú Corpais don Taighde Foclóireachta: Corpas Foclóireachta na Gaeilge (CFG2020)." TEANGA, the Journal of the Irish Association for Applied Linguistics 28 (December 9, 2021): 278–305. http://dx.doi.org/10.35903/teanga.v28i.726.

Full text
Abstract:
Leagtar amach sa pháipéar seo na céimeanna a leanadh le Corpas Foclóireachta na Gaeilge 2020 (CFG2020), corpas aonteangach 77.3 milliún focal, a thiomsú. Mínítear comhthéacs an tionscadail agus na riachtanais a spreag na cinntí a tógadh lena linn. Déantar cur síos ansin ar chéim an tiomsaithe agus ar na céimeanna próiseála. Tugtar spléachadh ar inneachar an chorpais, ar an acmhainn a cruthaíodh lena chuardach, agus ar an gcineál anailíse agus taighde a cumasaíodh leis seo. Tiomsaíodh CFG2020 ar an tuiscint gur réamhchéim é ar thionscadal níos leithne corpais, is ar an gcúis sin a dhéantar moltaí i dtaca lena fheabhsú agus lena mhéadú. [This paper sets out the steps followed in the compilation of Corpas Foclóireachta na Gaeilge 2020 (CFG2020), a monolingual 77.3 million word Irish-language corpus. The context and circumstances of the project are explained, along with the motivation for various decisions made. The compilation and processing stages are described in detail. The contents of the corpus are outlined and the resource created to query CFG2020 is presented, along with reference to the kinds of analysis and research which it enables. CFG2020 was created as a first step towards a proposed larger corpus project, and suggestions for improvement and expansion are therefore proposed.]
APA, Harvard, Vancouver, ISO, and other styles
7

Travis, Catherine E., and Rena Torres Cacoullos. "Making Voices Count: Corpus Compilation in Bilingual Communities." Australian Journal of Linguistics 33, no. 2 (May 2013): 170–94. http://dx.doi.org/10.1080/07268602.2013.814529.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

LOUREIRO-PORTO, LUCÍA. "ICE vs GloWbE: Big data and corpus compilation." World Englishes 36, no. 3 (September 2017): 448–70. http://dx.doi.org/10.1111/weng.12281.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Ling Lee, Joanna Chiew, Phoey Lee Teh, Sian Lun Lau, and Irina Pak. "Compilation of malay criminological terms from online news." Indonesian Journal of Electrical Engineering and Computer Science 15, no. 1 (July 1, 2019): 355. http://dx.doi.org/10.11591/ijeecs.v15.i1.pp355-364.

Full text
Abstract:
<p>A Malay language corpus has been established by the Institute of Language and Literature (Dewan Bahasa dan Pustaka, DBP in Malaysia). Most of the past research on the Malay language corpus has focused on the description, lexicography and translation of the Malay language. However, in the existing literature, there is no list of Malay words that categorizes crime terminologies. This study aims to fill that linguistic gap. First, we aggregated the most frequently used crime terminology words from Malaysian online news sources. Five hundred crime-related words were compiled. No automatic machines were in the initial process, but they were subsequently used to verify the data. Four human coders were used to validate the data and ensure the originality of the semantic understanding of the Malay text. Finally, major crime terminologies were outlined from a set of keywords to serve as taggers in our solution. The ultimate goal of this study is to provide a corpus for forensic linguistics, police investigations, and general crime research. This study has established the first corpus of a criminological text in the Malay language.</p>
APA, Harvard, Vancouver, ISO, and other styles
10

Faya-Cerqueiro, Fátima, and Gema Alcaraz-Mármol. "The Toledo Teacher Trainees corpus (TTT): Bridging the gap between students’ narratives and corpus linguistics." Research in Corpus Linguistics 8 (2020): 147–63. http://dx.doi.org/10.32714/ricl.08.01.10.

Full text
Abstract:
In recent decades a few research methods have resorted to L2 learners in order to analyse several aspects aiming at methodological improvements. One of them is corpus linguistics, which has largely contributed to the study of language production from a quantitative perspective. A very different one has been the compilation of perceptions of the L2 learning process using ‘narrative inquiry’ and qualitative methods of analysis. However, scholars have not addressed the combination of both methods. In this proposal we examine their main individual features and offer an interwoven line of research, applying the quantitative approach of corpus linguistics to the genre of language learning narratives. Thus, we present a new corpus of L2 learners’ perceptions and provide detailed information on its structure, compilation and categorisation. The interdisciplinary status of this proposal will enable the exploration of new research possibilities that can ultimately benefit the teaching-learning process.
APA, Harvard, Vancouver, ISO, and other styles
11

Petran, Florian, Marcel Bollmann, Stefanie Dipper, and Thomas Klein. "ReM: A reference corpus of Middle High German -- corpus compilation, annotation, and access." Journal for Language Technology and Computational Linguistics 31, no. 2 (July 1, 2016): 1–15. http://dx.doi.org/10.21248/jlcl.31.2016.208.

Full text
APA, Harvard, Vancouver, ISO, and other styles
12

Hong, Shinchul. "The BUFS Learner Corpus of Spoken English : Compilation and Applications." Journal of Language Sciences 29, no. 4 (November 30, 2022): 125–47. http://dx.doi.org/10.14384/kals.2022.29.4.125.

Full text
APA, Harvard, Vancouver, ISO, and other styles
13

Genç-Yöntem, Ece, and Evrim Eveyik-Aydın. "The compilation of a developmental spoken English corpus of Turkish EFL learners." Research in Corpus Linguistics 10, no. 1 (2021): 45–62. http://dx.doi.org/10.32714/ricl.10.01.03.

Full text
Abstract:
Although compiling a spoken learner corpus is not a recent enterprise, the number of developmental learner spoken corpora in the field of corpus linguistics is not satisfactory. This report describes the compilation of the Yeditepe Spoken Corpus of Learner English (YESCOLE), a 119,787-word corpus of Turkish students’ spoken English at tertiary level. YESCOLE was compiled to generate a developmental corpus of spoken interlanguage by collecting samples from learners of different English proficiency levels at regular short intervals over seven months. In order to shed light on the laborious methodology of compiling the developmental spoken learner corpus, this paper elucidates the steps taken to build YESCOLE and discusses its potential benefits for research and instructional purposes.
APA, Harvard, Vancouver, ISO, and other styles
14

Androutsopoulos, Jannis K. "Trends in Teenage Talk: Corpus Compilation, Analysis and Findings." Journal of Pragmatics 37, no. 4 (April 2005): 589–93. http://dx.doi.org/10.1016/j.pragma.2003.10.020.

Full text
APA, Harvard, Vancouver, ISO, and other styles
15

Bogetić, Ksenija, Vuk Batanović, and Nikola Ljubešić. "Corpus compilation for digital humanities in lower– resourced languages: A practical look at compiling thematic digital media corpora in Serbian, Croatian and Slovenian." Suvremena lingvistika 48, no. 94 (December 30, 2022): 129–52. http://dx.doi.org/10.22210/suvlin.2022.094.01.

Full text
Abstract:
The digital era has unlocked unprecedented possibilities of compiling corpora of social discourse, which has brought corpus linguistic methods into closer interaction with other methods of discourse analysis and the humanities. Even when not using any specific techniques of corpus linguistics, drawing on some sort of corpus is increasingly resorted to for empirically–grounded social–scientific analysis (sometimes dubbed ‘corpus–assisted discourse analysis’ or ‘corpus–based critical discourse analysis’, cf. Hardt–Mautner 1995; Baker 2016). In the post–Yugoslav space, recent corpus developments have brought table–turning advantages in many areas of discourse research, along with an ongoing proliferation of corpora and tools. Still, for linguists and discourse analysts who embark on collecting specialized corpora for their own research purposes, many questions persist – partly due to the fast–changing background of these issues, but also due to the fact that there is still a gap in the corpus method, and in guidelines for corpus compilation, when applied beyond the anglophone contexts. In this paper we aim to discuss some possible solutions to these difficulties, by presenting one step–by–step account of a corpus building procedure specifically for Croatian, Serbian and Slovenian, through an example of compiling a thematic corpus from digital media sources (news articles and reader comments). Following an overview of corpus types, uses and advantages in social sciences and digital humanities, we present the corpus compilation possibilities in the South Slavic language contexts, including data scraping options, permissions and ethical issues, the factors that facilitate or complicate automated collection, and corpus annotation and processing possibilities. The study shows expanding possibilities for work with the given languages, but also some persistently grey areas where researchers need to make decisions based on research expectations. Overall, the paper aims to recapitulate our own corpus compilation experience in the wider context of South–Slavic corpus linguistics and corpus linguistic approaches in the humanities more generally
APA, Harvard, Vancouver, ISO, and other styles
16

Paquot, Magali, Tove Larsson, Hilde Hasselgård, Signe O. Ebeling, Damien De Meyere, Larry Valentin, Natalia J. Laso, Isabel Verdaguer, and Sanne van Vuuren. "The Varieties of English for Specific Purposes dAtabase (VESPA): Towards a multi-L1 and multi-register learner corpus of disciplinary writing." Research in Corpus Linguistics 10, no. 2 (2022): 1–15. http://dx.doi.org/10.32714/ricl.10.02.02.

Full text
Abstract:
The Varieties of English for Specific Purposes dAtabase (VESPA first release) is the result of an international corpus compilation project that aims to address the lack of large-scale, open access, multi-L1, multi-discipline and multi-register learner corpora. This corpus report provides a detailed description of VESPA and illustrates possible uses of the corpus for register exploration of learner data. Specifically, it first offers an overview of the makeup of the corpus and the online interface that can be used to search and download the corpus. It then gives an illustrative example of a study where multi-dimensional analysis was used to investigate the relative importance of register vis-à-vis other factors in learner academic writing. In the concluding remarks, we identify priorities for future developments in the VESPA project, including the addition of more L1 components, more disciplines and more registers, as well as the compilation of a comparable corpus of native student writing.
APA, Harvard, Vancouver, ISO, and other styles
17

Säily, Tanja, and Jukka Tyrkkö. "Challenges of combining structured and unstructured data in corpus development." Research in Corpus Linguistics 9, no. 1 (2021): i—viii. http://dx.doi.org/10.32714/ricl.09.01.01.

Full text
Abstract:
Recent advances in the availability of ever larger and more varied electronic datasets, both historical and modern, provide unprecedented opportunities for corpus linguistics and the digital humanities. However, combining unstructured text with images, video, audio as well as structured metadata poses a variety of challenges to corpus compilers. This paper presents an overview of the topic to contextualise this special issue of Research in Corpus Linguistics. The aim of the special issue is to highlight some of the challenges faced and solutions developed in several recent and ongoing corpus projects. Rather than providing overall descriptions of corpora, each contributor discusses specific challenges they faced in the corpus development process, summarised in this paper. We hope that the special issue will benefit future corpus projects by providing solutions to common problems and by paving the way for new best practices for the compilation and development of rich-data corpora. We also hope that this collection of articles will help keep the conversation going on the theoretical and methodological challenges of corpus compilation.
APA, Harvard, Vancouver, ISO, and other styles
18

Ohashi, Yukiko, Noriaki Katagiri, Katsutoshi Oka, and Michiko Hanada. "ESP corpus design: compilation of the Veterinary Nursing Medical Chart Corpus and the Veterinary Nursing Wordlist." Corpora 15, no. 2 (August 2020): 125–40. http://dx.doi.org/10.3366/cor.2020.0191.

Full text
Abstract:
This paper reports on two research results: ( 1) designing an English for Specific Purposes (esp) corpus architecture complete with annotations structured by regular expressions; and ( 2) a case study to test the design to cater for creating a specific vocabulary list using the compiled corpus. The first half of this study involved designing a precisely structured esp corpus from 190 veterinary medical charts with a hierarchy of the data. The data hierarchy in the corpus consists of document types, outline elements and inline elements, such as species and breed. Perl scripts extracted the data attached to veterinary-specific categories, and the extraction led to creating wordlists. The second part of the research tested the corpus mode, creating a list of commonly observed lexical items in veterinary medicine. The coverage rate of the wordlists by General Service List (gsl) and Academic Word List (awl) was tested, with the result that 66.4 percent of all lexical items appeared in gsl and awl, whereas 33.7 percent appeared in none of those lists. The corpus compilation procedures as well as the annotation scheme introduced in this study enable the compilation of specific corpora with explicit annotations, allowing teachers to have access to data required for creating esp classroom materials.
APA, Harvard, Vancouver, ISO, and other styles
19

Blanco-Suárez, Zeltia, Francisco Gallardo-del-Puerto, and Evelyn Gandón-Chapela. "The Primary Education Learners’ English Corpus (PELEC): Design and compilation." Research in Corpus Linguistics 8 (2020): 147–63. http://dx.doi.org/10.32714/ricl.08.01.09.

Full text
Abstract:
This paper describes the process of design and compilation of the Primary Education Learners’ English Corpus (PELEC), a learner corpus which includes written (14,577 words) and spoken materials (47,032 words) from Primary Education learners in the Autonomous Community of Cantabria. It is composed of data from a total of 252 students in the fourth and sixth grade of Primary Education (aged 9–10 and 11–12, respectively) who were studying in five different state schools which followed either a Content and Language Integrated Learning (CLIL) or an English as a Foreign Language (EFL) approach.
APA, Harvard, Vancouver, ISO, and other styles
20

Nam Kil Im. "Compilation of tentatively named ‘Korean Usage Dictionary’ using Learner corpus." Korean Language Research ll, no. 20 (June 2007): 131–54. http://dx.doi.org/10.16876/klrc.2007..20.131.

Full text
APA, Harvard, Vancouver, ISO, and other styles
21

권혁승. "The SNU Korean Learner Corpus of English: Compilation and Application." English Language and Linguistics ll, no. 28 (December 2009): 203–28. http://dx.doi.org/10.17960/ell.2009..28.010.

Full text
APA, Harvard, Vancouver, ISO, and other styles
22

Carrió-Pastor, María Luisa, and Rut Muñiz-Calderón. "The Compilation of a Corpus of Business English: Syntactic Variation." Procedia - Social and Behavioral Sciences 95 (October 2013): 89–95. http://dx.doi.org/10.1016/j.sbspro.2013.10.626.

Full text
APA, Harvard, Vancouver, ISO, and other styles
23

Romer, Ute. "Trends in Teenage Talk: Corpus Compilation, Analysis and Findings (review)." Language 80, no. 4 (2004): 900–901. http://dx.doi.org/10.1353/lan.2004.0224.

Full text
APA, Harvard, Vancouver, ISO, and other styles
24

남기심 and Hansaem Kim. "A Study on Practice of Korean Dictionary Compilation Using Corpus." Journal of Korealex ll, no. 30 (November 2017): 7–36. http://dx.doi.org/10.33641/kolex.2017..30.7.

Full text
APA, Harvard, Vancouver, ISO, and other styles
25

Kwon, Miboon. "A Compilation of Hotel English Corpus and the Lexical Analysis." Journal of Language Sciences 23, no. 4 (November 30, 2016): 25–44. http://dx.doi.org/10.14384/kals.2016.23.4.025.

Full text
APA, Harvard, Vancouver, ISO, and other styles
26

Hong, Yun, and Liu Lu. "Parallel Corpus in Chinese-English Dictionary Compilation and It’s Problems." International Journal of Humanities and Social Science 6, no. 3 (May 25, 2019): 17–22. http://dx.doi.org/10.14445/23942703/ijhss-v6i3p104.

Full text
APA, Harvard, Vancouver, ISO, and other styles
27

Leiva Rojo, Jorge. "Diseño y compilación de corpus paralelos alineados: dificultades y (algunas) soluciones en el ejemplo de un corpus de textos museísticos traducidos (inglés-español)." Revista de Lingüística y Lenguas Aplicadas 13, no. 1 (July 13, 2018): 59. http://dx.doi.org/10.4995/rlyla.2018.7912.

Full text
Abstract:
<p>Text corpora are tools having both a long tradition in research and a variety of applications. Of all existing types, this paper focuses specifically on parallel, aligned corpora. By taking one of this corpora as a starting point—a parallel, aligned corpus from museum texts originally written in English and subsequently translated into Spanish—, the aim of this article is to propose a methodology that consists of four basic stages. By the revision of previous literature on the topic, and by using multiple software programs—proprietary and free, specifically created for corpus compilation and created for other purposes—, it is concluded that, although the compilation of corpora such as the one that was intended is a feasible task, the procedure is full of obstacles. Some obstacles were overcome, while some were not; that is the case, for example, of the repetitions on the aligned corpus, which are not present in the corpus.</p>
APA, Harvard, Vancouver, ISO, and other styles
28

Wang, Jianxin. "Recent Progress in Corpus Linguistics in China." International Journal of Corpus Linguistics 6, no. 2 (December 31, 2001): 281–304. http://dx.doi.org/10.1075/ijcl.6.2.05wan.

Full text
Abstract:
This paper discusses some of the new developments in corpus linguistics in China. In the area of Chinese corpus compilation it presents large-scale text databases, representative corpora, annotated corpora, lexical databases for information processing, phonological, dialectal, spoken and other specialized corpora. In connection with the analysis and annotation of Chinese corpora, the characteristics of the Chinese language, word segmentation, tagging, parsing, and some corpus analytical systems are described. Concerning English corpus studies, some corpora of English as a Foreign Language and corpus-based research are depicted. On this basis tentative conclusions are drawn.
APA, Harvard, Vancouver, ISO, and other styles
29

Pascual, Daniel, Pilar Mur-Dueñas, and Rosa Lorés. "Looking into international research groups’ digital discursive practices: Criteria and methodological steps taken towards the compilation of the EUROPRO digital corpus." Research in Corpus Linguistics 8, no. 2 (2020): 87–102. http://dx.doi.org/10.32714/ricl.08.02.05.

Full text
Abstract:
The EUROPRO digital corpus was designed by the InterGedi research group, based at the University of Zaragoza (Spain). The main focus of InterGedi is the analysis of the textual resources used by international research groups as part of their dissemination and visibility strategies. The corpus comprises a collection of 30 international research project websites funded by the European Horizon2020 Programme (EUROPROwebs corpus). By looking into their websites, 20 projects were observed to maintain a Twitter account and the tweets from these accounts were the basis for the compilation of the EUROPROtweets corpus. This paper delves into the criteria used for the selection of the research project websites and the methodological steps taken to classify, label and tag the verbal component in these websites and tweets. The paper discusses the challenges in the compilation of the corpus because of the dynamic, hypermodal, and hypermedial nature of the digital texts it contains. The paper closes by underlining the potential uses and applications of EUROPRO in order to gain insights into the digital discursive and professional practices used by international research groups to foster their visibility online.
APA, Harvard, Vancouver, ISO, and other styles
30

Gamper, Johann, and Oliviero Stock. "Corpus-based terminology." Terminology 5, no. 2 (December 31, 1998): 147–59. http://dx.doi.org/10.1075/term.5.2.05gam.

Full text
Abstract:
The manual acquisition of terminological material from the domain-specific text material is a very time-consuming task. Recent advances in text-processing research provide a basis for automating this task. Computer-assisted term acquisition improves both the quantity and the quality of terminological work. This paper gives a brief overview of this new approach in terminology acquisition. Three subtasks are distinguished: compilation of an electronic text corpus, extraction of terminological data, and management of terminological data. Each of the subtasks will be discussed in some detail by identifying the core problems as well as proposed solutions. As a concrete initiative in this emerging field, we present an ongoing research project at the European Academy Bolzano, which illustrates the importance of computer-assisted terminology acquisition and of the resulting steps that have been taken in recent times. The paper concludes with a summary of five selected papers which have been presented at a workshop on corpus-based terminology in Bolzano. The full papers are published in this volume and in volume 4(2) of this journal.
APA, Harvard, Vancouver, ISO, and other styles
31

Xu, Jiajin. "Corpus-based Chinese studies." Chinese Language and Discourse 6, no. 2 (December 30, 2015): 218–44. http://dx.doi.org/10.1075/cld.6.2.06xu.

Full text
Abstract:
This article reviews corpus-based Chinese studies, both applied and theoretical, from the 1920s to the present. It will be shown that, while corpus-based Chinese studies have been gaining momentum for only the last couple of decades, the roots of Chinese corpus linguistics go all the way back to the beginning of the 20th century. Today the bulk of corpus-based Chinese studies is oriented toward applied linguistics, with the compilation of frequency character/word lists and interlanguage Chinese studies being the most popular types of research. In addition to applied linguistic studies, this overview also highlights some innovative corpus studies on lexical and grammatical aspects of both classical and modern Chinese, as well as studies of sociolinguistic variation and discourse pragmatics. Overall, important groundwork in Chinese corpus linguistics is acknowledged and future directions are discussed.
APA, Harvard, Vancouver, ISO, and other styles
32

Verdonik, Darinka, Iztok Kosem, Ana Zwitter Vitez, Simon Krek, and Marko Stabej. "Compilation, transcription and usage of a reference speech corpus: the case of the Slovene corpus GOS." Language Resources and Evaluation 47, no. 4 (January 29, 2013): 1031–48. http://dx.doi.org/10.1007/s10579-013-9216-5.

Full text
APA, Harvard, Vancouver, ISO, and other styles
33

Volodina, Elena, Lena Granstedt, Arild Matsson, Beáta Megyesi, Ildikó Pilán, Julia Prentice, Dan Rosén, et al. "The SweLL Language Learner Corpus." Northern European Journal of Language Technology 6 (December 20, 2019): 67–104. http://dx.doi.org/10.3384/nejlt.2000-1533.19667.

Full text
Abstract:
The article presents a new language learner corpus for Swedish, SweLL, and the methodology from collection and pesudonymisation to protect personal information of learners to annotation adapted to second language learning. The main aim is to deliver a well-annotated corpus of essays written by second language learners of Swedish and make it available for research through a browsable environment. To that end, a new annotation tool and a new project management tool have been implemented, – both with the main purpose to ensure reliability and quality of the final corpus. In the article we discuss reasoning behind metadata selection, principles of gold corpus compilation and argue for separation of normalization from correction annotation.
APA, Harvard, Vancouver, ISO, and other styles
34

Berglund, Ylva. "Exploiting a Large Spoken Corpus." International Journal of Corpus Linguistics 4, no. 1 (August 13, 1999): 29–52. http://dx.doi.org/10.1075/ijcl.4.1.03ber.

Full text
Abstract:
The British National Corpus (BNC) contains a spoken component of about 10 million words, consisting of spoken language of various kinds produced by different speakers in a variety of situations. Starting from an end-user s perspective, this paper surveys the potential of this resource and some possible problems one might encounter if not fully versed in the details of the compilation and coding plans. Among the issues touched upon are questions relating to the composition of the component, the transcription principles employed, and points relating to the nature and coverage of the mark-up. By way of illustration, examples are drawn from a case study of the variant forms gonna and going to.
APA, Harvard, Vancouver, ISO, and other styles
35

Goutsos, Dionysis. "The Corpus of Greek Texts: a reference corpus for Modern Greek." Corpora 5, no. 1 (May 2010): 29–44. http://dx.doi.org/10.3366/cor.2010.0002.

Full text
Abstract:
This paper reports on the construction of a reference corpus for Modern Greek, the Corpus of Greek Texts (CGT), that is currently being developed at the University of Athens. In particular, it points out the need for an authoritative corpus of Greek in view of the limitations of existing attempts to compile corpora for the language. It also presents the aims and identity of CGT with particular reference to its structure (composition of data and text classification). Questions of corpus design, which are particularly important with respect to available resources for Greek, are considered in relation to the issue of representativeness in material selection. The phases of implementation of CGT compilation are presented in detail. Finally, the larger implications of the project are detailed and applications, as well as prospects for further development, are outlined. Special mention is made of linguistic research papers on aspects of Greek that have used CGT data.
APA, Harvard, Vancouver, ISO, and other styles
36

Izquierdo, Marlén, Knut Hofland, and Øystein Reigem. "The ACTRES parallel corpus: an English–Spanish translation corpus." Corpora 3, no. 1 (May 2008): 31–41. http://dx.doi.org/10.3366/e1749503208000051.

Full text
Abstract:
This paper describes the compilation of the ACTRES Parallel Corpus, an English–Spanish translation corpus built at the Department of Modern Languages at the University of León (Spain) by the ACTRES research group. The computerisation of the corpus was carried out in collaboration with Knut Hofland and Øystein Reigem, from the Department of Culture, Language and Information Technology, Aksis, at the UNIFOB/University of Bergen (Norway). The corpus is conceived as a powerful tool for cross-linguistic research in the fields of Contrastive Analysis and Descriptive Translation Studies. It was the need to bridge the gap between these disciplines and to extend applications that encouraged the building of a parallel corpus as a suitable tool to achieve these goals. This paper focusses on the practical aspects of building the corpus. A brief account of the research which prompted this endeavour precedes the description of this process. 4 4 This paper is an account of the building of the ACTRES Parallel Corpus, so no empirical results from research done on the basis of the corpus are reported here. Concerning new insights drawn from the actual use of P-ACTRES in English–Spanish translation and contrastive projects, there is an extended bibliography at: http://actres.unileon.es/
APA, Harvard, Vancouver, ISO, and other styles
37

Diemer, Stefan, Marie-Louise Brunner, and Selina Schmidt. "Compiling computer-mediated spoken language corpora." Compilation, transcription, markup and annotation of spoken corpora 21, no. 3 (September 19, 2016): 348–71. http://dx.doi.org/10.1075/ijcl.21.3.03die.

Full text
Abstract:
This paper discusses key issues in the compilation of spoken language corpora in a computer-mediated communication (CMC) environment, using data from the Corpus of Academic Spoken English (CASE), a corpus of Skype conversations currently being compiled at Saarland University, Germany, in cooperation with European and US partners. Based on first findings, Skype is presented as a suitable tool for collecting informal spoken data. In addition, new recommendations concerning data compilation and transcription are put forward to supplement existing best practice as presented in Wynne (2005). We recommend the preservation of multimodal features during anonymisation, and the addition of annotation elements already at the transcription stage, particularly CMC-related discourse features, English as a Lingua Franca (ELF) features (e.g. non-standard language and code-switching), as well as the inclusion of prosodic, paralinguistic, and non-verbal annotation. Additionally, we propose a layered corpus design in order to allow researchers to focus on specific annotation features.
APA, Harvard, Vancouver, ISO, and other styles
38

Huang, Yinxia. "Compilation and Application of Translation-Tagged Corpus for Cross-Linguistic Study." Language and Information 21, no. 2 (July 31, 2017): 137–57. http://dx.doi.org/10.29403/li.21.2.7.

Full text
APA, Harvard, Vancouver, ISO, and other styles
39

Matamala, Anna. "The VIW project." Revista Española de Lingüística Aplicada/Spanish Journal of Applied Linguistics 32, no. 2 (November 5, 2019): 515–42. http://dx.doi.org/10.1075/resla.17001.mat.

Full text
Abstract:
Abstract Following an overview of corpus linguistics in audiovisual translation, and more specifically in audio description, this article presents the VIW (Visuals Into Words) project and its resulting corpus. It describes the compilation and annotation processes, highlighting the main challenges found. The article also presents the web application that has been developed, explaining in detail various data visualisation and search possibilities.
APA, Harvard, Vancouver, ISO, and other styles
40

Gusarova, Ekaterina V. "The St. Sisynnios Ethiopian Legend Revisited." Scrinium 15, no. 1 (July 16, 2019): 340–45. http://dx.doi.org/10.1163/18177565-00151p23.

Full text
Abstract:
Abstract The St. Sisynnios legend is an integral part of both Christian and popular Ethiopian historical traditions. It is known to exist in the Ge’ez language and constitutes a part of the compilation corpus based upon the so called magic or protective scrolls. There are two versions of the vita of St. Sisynnios. The shorter one is found in the Synaxarion, whereas the longer one is included in a corpus of hagiographical compilations “The Lives of the Martyrs”. The text of the legend comprises various stories based on real facts from the Saint’s life. However only some of them have been preserved intact; others have been re-told. Until recently have been discovered only three redactions of the vita. A new redaction recently discovered by the author of this article is of a paramount importance since it changes our view on how this legend did exist indeed in the Ethiopian cultural tradition.
APA, Harvard, Vancouver, ISO, and other styles
41

Sánchez Ramos, María del Mar. "Corpus paralelos y traducción especializada: ejemplificación de diseño, compilación y alineación de un corpus paralelo bilingüe (inglés-español) para la traducción jurídica." Lebende Sprachen 64, no. 2 (November 5, 2019): 269–85. http://dx.doi.org/10.1515/les-2019-0015.

Full text
Abstract:
Abstract The article reports on the processing steps followed to build a bilingual parallel corpus (English-Spanish) as a resource for legal translation. The corpus, being in its initial stage of development, is made up of 127 and 145 aligned judgments referring to English and Spanish courts respectively. The corpus was aligned by using InterText alignment software, accounting for a total number of 29983 aligned sentence pairs. The paper describes the different design stages and the technical issues related to the compilation process.
APA, Harvard, Vancouver, ISO, and other styles
42

Koeva, Svetla. "Bulgarian sense-annotated corpus – between the tradition and novelty." Cognitive Studies | Études cognitives, no. 12 (November 24, 2015): 181–98. http://dx.doi.org/10.11649/cs.2012.012.

Full text
Abstract:
Bulgarian sense-annotated corpus – between the tradition and noveltyThe Bulgarian Sense-annotated Corpus (BulSemCor) is compiled according to the general methodology established by the SemCor project. It is a subset of the Brown Corpus of Bulgarian semantically annotated with a corresponding synonym set (synset) in the Bulgarian wordnet. Unlike the bulk of sense-annotated corpora where only (sets of) content words are annotated, in BulSemCor each lexical unit has been assigned a sense. The main contributions achieved in the work on BulSemCor are briefly decides in the presented paper: definition of an annotation schema, compilation of an input corpus, development of a sense-annotated corpus, Bulgarian wordnet enlargement.
APA, Harvard, Vancouver, ISO, and other styles
43

Boyce, Mary. "Mana Aha? Exploring the Use of Mana in the Legal Māori Corpus." Victoria University of Wellington Law Review 42, no. 2 (August 1, 2011): 221. http://dx.doi.org/10.26686/vuwlr.v42i2.5136.

Full text
Abstract:
The Legal Māori Corpus (LMC) is one of several major outputs of the Legal Māori Project, and provides the core evidence for the compilation of the Legal Māori Dictionary, due to be completed in 2012. To our knowledge it is the largest publicly available corpus of te reo Māori. The LMC is comprised of 8 million words of running text, compiled from printed legal texts in te reo Māori spanning from the 1820s to the current day. The pre-1910 text collection (5.2 million words) from the LMC is now publicly available on the Victoria University of Wellington Law Faculty website. Those remaining texts (1.8 million words printed from 1910 onwards) that are able to be cleared of copyright and confidentiality restrictions will be released in 2012. This paper briefly outlines the context of the Legal Māori Project, describes the compilation and structure of the LMC, and then focuses in detail on the use of the word mana in the corpus. It identifies the common collocations and phrases that contain mana, and looks at their distribution over time.
APA, Harvard, Vancouver, ISO, and other styles
44

Fernández-Cruz, Javier, and Antonio Moreno-Ortiz. "Building the Great Recession News Corpus (GRNC): A contemporary diachronic corpus of economy news in English." Research in Corpus Linguistics 8, no. 2 (2020): 28–45. http://dx.doi.org/10.32714/ricl.08.02.02.

Full text
Abstract:
The paper describes the process involved in developing the Great Recession News Corpus (GRNC); a specialized web corpus, which contains a wide range of written texts obtained from the Business section of The Guardian and The New York Times between 2007 and 2015. The corpus was compiled as the main resource in a sentiment analysis project on the economic/financial domain. In this paper we describe its design, compilation criteria and methodological approach, as well as the description of the overall creation process. Although the corpus can be used for a variety of purposes, we include a sentiment analysis study on the evolution of the sentiment conveyed by the word credit during the years of the Great Recession which we think provides validation of the corpus.
APA, Harvard, Vancouver, ISO, and other styles
45

Collins, Peter, and Xinyue Yao. "AusBrown: A new diachronic corpus of Australian English." ICAME Journal 43, no. 1 (March 1, 2019): 5–21. http://dx.doi.org/10.2478/icame-2019-0001.

Full text
Abstract:
Abstract This paper presents a newly-compiled diachronic corpus of Australian English (AusBrown). With four sampling time points (1931, 1961, 1991 and 2006), Aus-Brown is designed to match the current suite of British and American ‘Brown-family’ corpora in both sampling year and design. We provide details of the composition and compilation of AusBrown, and explore the broader context of its ‘Brown-family background’ and of complementary Australian corpora. We also overview research based on the Australian corpora presented, including several AusBrown-based papers.
APA, Harvard, Vancouver, ISO, and other styles
46

Mello, Heliana, Amina Mettouchi, Marianne Mithun, Alessandro Panunzi, and Tommaso Raso. "Prosody and Corpora." Cadernos de Linguística 2, no. 1 (August 1, 2021): e385. http://dx.doi.org/10.25189/2675-4916.2021.v2.n1.id385.

Full text
Abstract:
This paper focuses on the experience of spoken corpora compilation and discusses the relevance of prosody in this type of endeavor, as well as in the study of spoken language in its several possibilities. Through the voices of scholars associated with four different projects (CorpAfroAs, Mohawk Corpus, LABLITA, C-ORAL-BRASIL), the steps considered of utmost relevance in both the compilation and research potential of spoken corpora are presented; additionally, perspectives for the field in the future are pointed out.
APA, Harvard, Vancouver, ISO, and other styles
47

Wehrmeyer, Ella. "A corpus for signed language interpreting research." Interpreting. International Journal of Research and Practice in Interpreting 21, no. 1 (March 13, 2019): 62–90. http://dx.doi.org/10.1075/intp.00020.weh.

Full text
Abstract:
Abstract Because of the visual nature of signed language, the compilation of a signed language interpreting corpus along the lines of spoken-language interpreting corpora has been viewed as extremely challenging, if not impossible. This study offers a unique contribution in the construction of a lemmatized, annotated text-based corpus of signed language media interpretations, which allows analysis of interesting features using readily-available concordance software. In this article, characteristics of original (not interpreted) signed language corpora are explored in terms of metadata conventions, transcription and annotation, in order to provide a framework for an interpreting corpus. Within this framework, the decisions and steps taken in the construction of the interpreting corpus are discussed and explained.
APA, Harvard, Vancouver, ISO, and other styles
48

Herry-Bénit, Nadine, Stéphanie Lopez, Takeki Kamiyama, and Jeff Tennant. "The interphonology of contemporary English corpus (IPCE-IPAC)." International Journal of Learner Corpus Research 7, no. 2 (October 11, 2021): 275–89. http://dx.doi.org/10.1075/ijlcr.20010.her.

Full text
Abstract:
Abstract This article presents the IPCE-IPAC corpus, an ongoing project, which has been collected in France, Italy, Spain and China since 2014. The data is collected to investigate the acquisition of segmental and suprasegmental phenomena by L2 learners of English, with a focus on phonemes. The article discusses the methods for the compilation of this original spoken learner corpus, designed to study L2 “interphonology” (Detey, Racine, Kawaguchi, & Zay, 2016), or interlanguage phonology.
APA, Harvard, Vancouver, ISO, and other styles
49

Zoller, Robert. "The Hermetica as Ancient Science." Culture and Cosmos 1, no. 02 (October 1997): 23–34. http://dx.doi.org/10.46472/cc.0201.0211.

Full text
Abstract:
The Corpus Hermeticum1(CH) is a compilation of philosophical, theosophical, mystical and cosmological texts dating, in their present form, from the third-fourth centuries C.E. and attributed, according to tradition, to Hermes Trismegistus. These texts exerted considerable influence upon western philosophers, scientific thinkers and mystics throughout the Middle Ages, especially influencing the Renaissance Neoplatonists. Scholarly discussions of the Hermetica once focused upon this Corpus, which some have felt contains lofty ideals and noble speculations.
APA, Harvard, Vancouver, ISO, and other styles
50

Hernández, Nuria. "New media, new challenges: exploring the frontiers of corpus linguistics in the linguistics curriculum." Research in Corpus Linguistics 1 (2013): 17–31. http://dx.doi.org/10.32714/ricl.01.03.

Full text
Abstract:
This paper introduces a new corpus of computer-mediated communication which is currently being compiled at the University of Duisburg-Essen. Based on the experience from this project, the paper also discusses the possibility of implementing major issues in corpus construction into the academic curriculum of young linguists in the form of project-based learning. A variety of new challenges and possible solutions regarding the compilation and processing of new media language are presented.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography