Plenary Speakers IWoDA 2018

Marcus Callies (University of Bremen, Germany)

Title of the paper: Learner Corpus Research and the assessment of L2 proficiency: Current practice and challenges for the future


Proficiency is a complex and multidimensional construct that underlies the teaching, learning, research and assessment of foreign languages. For SLA research, proficiency measures are crucial because a) without them, meaningful interpretation of research results remains elusive, and b) proficiency has been shown to affect the systematicity and variability of learner language. Measures should thus be valid, reliable and practical (Leclercq & Edmonds 2014: 10-11). However, proficiency is sometimes inadequately assessed, thereby limiting the generalizability of results. This is particularly true of global proficiency measures, such as learners’ institutional status, assessment on the basis of holistic rating scales by human raters, or learners’ scores on standardized tests where learner output is constrained by the respective task(s) (Thomas 1994, 2006).

In this talk, I will address recent developments at the interface of Learner Corpus Research (LCR) and Language Testing and Assessment (e.g. Callies & Götz 2015). LCR is a fairly recent computational approach to testing and assessing L2 proficiency but it has great potential to inform, supplement and possibly advance the way proficiency is operationalized and measured. I will first critically review how the construct of proficiency has been dealt with in learner corpus compilation and analysis, and then outline how learner corpora can contribute to current practices of measuring learners’ proficiency level by adopting a text-centred, data-driven approach that is partially independent of human rating (Callies, Diez-Bedmar & Zaytseva 2014).

I will then present a case study of the assessment of writing proficiency in the academic register based on the Corpus of Academic Learner English (CALE; Callies & Zaytseva 2013) that includes various text types produced by learners of English as a Foreign Language (EFL) in a university setting. Writing proficiency in the academic register will be operationalized by means of quantifiable linguistic descriptors based on texts produced by native and learner writers of academic English. A corpus-informed identification and corpus-based implementation of well-known characteristics of academic English are combined with a corpus-driven assessment of proficiency, accounting for inter-learner variability.


Callies, M. & S. Götz, eds. (2015). Learner Corpora in Language Testing and Assessment. Amsterdam: Benjamins.

Callies, M. & Zaytseva. E. (2013), The Corpus of Academic Learner English (CALE) – A new resource for the assessment of writing proficiency in the academic register. Dutch Journal of Applied Linguistics 2(1), 126-132.

Callies, M., Diez-Bedmar, M.B. & Zaytseva, E. (2014), Using learner corpora for testing and assessing L2 proficiency. In Leclercq, P., Hilton, H. & Edmonds, A. (eds.), Measuring L2 Proficiency: Perspectives from SLA. Clevedon: Multilingual Matters, 71-90.

Leclercq, P. & A. Edmonds, A. (2014). How to assess L2 proficiency? An overview of proficiency assessment research. In Leclercq, P., H. Hilton & A. Edmonds (eds.), Measuring L2 proficiency: Perspectives from SLA (Second Language Acquisition series). Clevedon: Multilingual Matters, 3-23.

Thomas, M. (1994). Assessment of L2 proficiency in second language acquisition research. Language Learning 44:2, 307–336.

Thomas, M. (2006). Research synthesis and historiography: The case of assessment of second language proficiency. In Norris, J.M. Norris & L. Ortega (eds.), Synthesizing Research on Language Learning and Teaching. Amsterdam: Benjamins, 279-298.


Marcus Callies studied history and English at the University of Marburg/Germany where he obtained his first teaching certificate for German secondary schools in 2000, and a PhD in English linguistics in 2006. He worked as lecturer and assistant professor at the universities of Feiburg, Mainz and Bremen in Germany. Since 2014 he is full professor of English Linguistics at the University of Bremen. One of his main research interests is learner corpus research with a focus on lexico-grammatical variation, discourse-functional and pragmatic aspects of advanced learner varieties and English for Academic Purposes. Marcus is the main compiler of the Corpus of Academic Learner English, a specialised corpus of academic learner writing for a detailed, empirical, quantitative and qualitative description of advanced learner writing in the academic register. He is serving as co-editor of the International Journal of Learner Corpus Research and vice-president of the Learner Corpus Association.

Liesbeth Degand (Université Catholique de Louvain)

Foto de Liesbeth DegandTitle of the paper: Discourse Markers as (dis)fluency markers


Different features can contribute to the fluency (or disfluency) of discourse, among which speech rate, (filled and silent) pauses, repetitions, false starts or discourse markers. In our approach, (dis)fluency is defined as i) componential (fluency can be observed as combinations or sequences of fluencemes), ii) situational (production and perception of fluency is highly influenced by contextual factors) and iii) ambivalent (the same feature can be either fluent or disfluent depending on its local and global context).

In this presentation, I will focus on the interaction between two types of fluencemes (Götz 2013), namely discourse markers and filled pauses (Crible, Degand, Gilquin 2017). Discourse markers are generally viewed as contributing to speakers’ fluency (Hasselgren 2002; Müller 2005; Götz 2013), although some are stigmatised as informal, disfluent elements of speech. Similarly, filled pauses, while said to encode hesitations and difficulties, have also been shown to positively help speech production and processing (O’Connell & Kowal 2009). Tentative interpretations of their role as either fluency signals or disfluency symptoms will be drawn from the synthesis of our corpus-based observations. The outcome of this research will help us determine if (certain types of) discourse markers are more prominent as fluency markers than others.


Crible, Ludivine, Liesbeth Degand, and Gaëtanelle Gilquin. 2017. “The Clustering of Discourse Markers and Filled Pauses: A Corpus-Based French-English Study of (Dis)Fluency.” Languages in Contrast 17 (1): 69–95.

Götz, S. 2013. Fluency in Native and Nonnative English Speech. Amsterdam: John Benjamins.

Hasselgren, A. 2002. Learner corpora and language testing: Small words as markers of learner fluency. In S. Granger, J. Hung & S. Petch-Tyson (eds), Computer-Learner Corpora, Second Language Acquisition, and Foreign Language Teaching, Philadelphia, John Benjamins: 143-173.

Müller, S. 2005. Discourse Markers in Native and Non-native English Discourse. Amsterdam: John Benjamins.

O’Connell, Daniel, and Sabine Kowal. 2009. Communicating with One Another: Toward a Psychology of Spontaneous Spoken Discourse. Springer Science & Business Media.

Liesbeth Degand is a professor of General and Dutch linguistics at the University of Louvain (UCLouvain, Belgium). She holds her PhD from the same university (1997). Her research lies within the Institute for Language and Communication, of which she was the Head for six years (2009-2015). She lead and participated in several international and national research projects in the area of spoken and written discourse structure, grammaticalization and intersubjectification, discourse annotation, and fluency and disfluency markers. She was the chair of the European COST network TextLink (2014-2018), aiming at bringing together functional-cognitive and computational work on the annotation of discourse relational devices in more than 20 different languages. Her publications reflect her research interests directed towards discourse annotation, spoken discourse segmentation, the semantics and pragmatics of discourse markers, and contrastive (corpus) linguistics, with a focus on the interface between discourse and grammar.

Pascual Pérez-Paredes (Cambridge University)

Title of the paper: Learner language research beyond contrastive interlanguage analysis: rethinking epistemology


Contrastive interlanguage analysis (CIA) has allowed researchers to tap into how language learners use their L2 or L3 by examining the frequency of different discrete linguistic features. The rationale behind such analysis is that L1 groups of learners show distinctive distributional features that can help researchers understand L1-L2 interfaces, general communication features in an L2 or, among others, language development at different competence levels. Arguably, CIA has attracted limited interest outside the corpus linguistics community as SLA research and most language education theories have generally failed to appreciate the relevance of this type of research in their own debates about language learning (Gablasova, Brezina & McEnery, 2017). I maintain that the over-stress on the learner’s mother tongue as the factor that has been most discussed in learner corpus research (Paquot & Granger, 2012) may have discouraged SLA researchers from using corpora and corpus-driven findings. In this sense, Myles (2015) has suggested that SLA research and SLA theories have “more sophisticated agendas”.

I will discuss two research projects that combine CIA methods with other research methods. The first research (Pérez-Paredes & Díez-Bedmar, 2018) adopts a parallel sequential design where different methods (POS keyness (Rayson, 2008, 2009) and automatic analysis of syntactic sophistication (Kyle, 2016)) query the data independently. This research sets out to characterize the writing of Spanish young EFL learners in different instructed settings by looking at naturally occurring language use in a set of essays on the same topic. A subset of the International Corpus of Crosslinguistic Interlanguage (ICCI) (Tono and Díez-Bedmar 2014) was used for the analysis. The second research (O´Keeffe, Pérez-Paredes & Mark, 2018) adopted Ellis, Römer & O’Donnell´s (2016) usage based language acquisition approach and examined Verb Argument constructions (VACs) development across EFL performance levels (A2, B2, C2) in the Cambridge Learner Corpus, a 55-million-word corpus of learner exam data, from over 200,000 exam scripts, across 200 countries, from candidates of over 140 first language backgrounds. The use of syntactic pattern analyses offered researchers the possibility to both examine units of analyses that go beyond isolated lexical items and track down how VACs evolve across language development.

In this talk, I will argue that learner corpus research needs to re-focus its epistemology and strengthen the use of what I call general corpus research methods. Traditional CIA-related findings and, in particular, an over-reliance on analysis of errors or “non-native” speaker underperformance need to be re-examined so as to go beyond the limitations of CIA and contribute to the body of data of interest to SLA researchers outside the corpus linguistics community. 


Gablasova, D., Brezina, V. & McEnery, T. (2017). Exploring learner language through corpora: comparing and interpreting corpus frequency information. Language Learning 67(S1):130-154.

Ellis, N. C., Römer, U. & O’Donnell, M. B. (2016). Usage-based Approaches to Language Acquisition and Processing: Cognitive and Corpus Investigations of Construction Grammar. Language Learning Monograph Series. Wiley-Blackwell.

Myles, F. (2015). Second language acquisition theory and learner corpus research. In S. Granger, G. Gilquin, & F. Meunier (Eds.), The Cambridge Handbook of Learner Corpus Research (Cambridge Handbooks in Language and Linguistics, pp. 309-332). Cambridge: Cambridge University Press.

O´Keeffe, A., Pérez-Paredes, P. & Mark, G. (2018). The English Grammar Profile: Investigating Patterns of Learner Grammar Development. Presentation at the American Association for Applied Linguistics 2018 Conference, Chicago, 24-27 March.

Kyle, K. (2016). Measuring syntactic development in L2 writing: Fine grained indices of syntactic complexity and usage-based indices of syntactic sophistication. PhD Dissertation, Georgia State University. URL: (1 February, 2018)

Paquot, M., & Granger, S. (2012). Formulaic Language in Learner Corpora. Annual Review of Applied Linguistics, 32, 130-149.

Pérez-Paredes, P. & Díez-Bedmar, B. (2018) Researching learner language through POS Keyword analysis and syntactic complexity. In S. Götz and J. Mukherjee (EDS.) Learner Corpora and Language Teaching. Studies in Corpus Linguistics Series. Amsterdam: John Benjamins.

Rayson, P. (2008). From key words to key semantic domains. International Journal of Corpus Linguistics 13(4): 519-549.

Rayson, P. (2009). Wmatrix: a web-based corpus-processing environment, Computing Department, Lancaster University. URL: (1 February, 2018)

Tono, Y. & Díez-Bedmar, B. (2014). Focus on learner writing at the beginning and intermediate stages: The ICCI corpus. International Journal of Corpus Linguistics 19(2): 163-177.

Pascual Pérez-Paredes is a Lecturer in Research in Second Language Education at the Faculty of Education, University of Cambridge. His main research interests are learner language variation, the use of corpora in language education and corpus-assisted discourse analysis. He has published research in journals such as the International Journal of Corpus Linguistics, CALL, Language, Learning & Technology, System, ReCALL or Discourse & Society. He is a member of the editorial board of, among others, Register Studies (John Benjamins).

Paul Rayson (Lancaster University)

Title of the workshop: Customisable semantic analysis methods for discourse analysis in Wmatrix


This 4-hour lab-based practical workshop will focus on semantic analysis methods from the world of Natural Language Processing and how they can improve on word-based methods from Corpus Linguistics. We will see how this combined approach can be applied for discourse analysis of a variety of texts and purposes, and what advantages adding the semantic analysis level gives over previous word level approaches to home in on linguistically meaningful units such as multiword expressions and constructions. Part of the workshop time will be presentation-led and computer-based activities taking participants through a series of tutorials to familiarise themselves with the UCREL Semantic Analysis System (USAS) taxonomy and the methods and techniques available in the Wmatrix web-based corpus annotation and retrieval software. A new version, Wmatrix4, will be introduced which permits semantic analysis in English and other languages supported by the multilingual USAS taggers, which are the result of research led with Scott Piao at Lancaster University and contributions from numerous other scholars around the world. As well as case studies of political discourse using UK general election manifesto data, workshop participants will also be able to bring and analyse their own corpora and be guided through the steps required for file conversion and preparation for analysis in Wmatrix and other corpus linguistics software. Although the USAS tagger provides wide coverage of English and other languages, it may miscategorise domain-specific terminology. We will describe and experiment with two ongoing projects. First, to systematically update new terminology and unknown word senses in the USAS dictionaries (joint research with Sheryl Prentice at Lancaster University). Second, to allow more user-customisable dictionaries and semantic taxonomy updates which will permit other types of profiling, for example of learner language, via the new My Dictionaries feature in Wmatrix (joint research with Hiroko Usami, Tokai University, Japan).


Archer, D., Rayson, P., Piao, S., McEnery, T. (2004). Comparing the UCREL Semantic Annotation Scheme with Lexicographical Taxonomies. In Williams G. and Vessier S. (eds.) Proceedings of the 11th EURALEX (European Association for Lexicography) International Congress (Euralex 2004), Lorient, France, 6-10 July 2004. Université de Bretagne Sud. Volume III, pp. 817-827.

Piao, S., Bianchi, F., Dayrell, C., D’Egidio, A. and Rayson, P. (2015). Development of the multilingual semantic annotation system. In proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics – Human Language Technologies (NAACL HLT 2015), Denver, Colorado, United States, pp. 1268-1274.

Piao, S., Rayson, P., Archer, D., Bianchi, F., Dayrell, C., El-Haj, M., Jiménez, R.-M., Knight, D., Kren, M., Löfberg, L., Nawab, R., Shafi, J., Teh, P-L., and Mudraya, O. (2016) Lexical Coverage Evaluation of Large-scale Multilingual Semantic Lexicons for Twelve Languages. In proceedings of the 10th edition of the Language Resources and Evaluation Conference (LREC2016), Portoroz, Slovenia, pp. 2614-2619.

Rayson, P. (2008). From key words to key semantic domains. International Journal of Corpus Linguistics. 13:4 pp. 519-549.

Rayson, P. (2009) Wmatrix: a web-based corpus processing environment, Computing Department, Lancaster University.

Paul Rayson is the director of the UCREL research centre and a Reader in the School of Computing and Communications, at Lancaster University, UK. A long term focus of his work is the application of semantic-based Natural Language Processing and Corpus Linguistics methods in extreme circumstances where language is noisy e.g. in historical, learner, speech, email, txt and other CMC varieties. His applied research is in the areas of dementia detection, online child protection, cyber security, learner dictionaries, and text mining of historical and biomedical corpora and annual financial reports. He was a co-investigator of the five-year ESRC Centre for Corpus Approaches to Social Science (CASS) which was designed to bring the corpus approach to bear on a range of social sciences. He is also a member of the multidisciplinary centre Security Lancaster, and Lancaster Digital Humanities, and the Data Science Institute.