Welcome to BangorTalk

The ESRC Centre

This site holds the conversational corpora assembled by the former ESRC Centre for Research on Bilingualism in Theory & Practice at University of Wales Bangor.

We are seeking to gain a greater understanding of how bilingual individuals in a variety of communities manage both their languages within the same conversation.

The questions we consider include:

  1. Do bilinguals in different types of communities handle their two languages in different ways in conversation?
  2. How do social variables such as class, age and gender affect the way people handle their two languages in informal conversations?

The corpora

To date, we have assembled three corpora:

Summary data for each corpus:
WelshEnglishSpanishindeterminateTotal (words)

2019 update: tsv and sql files for words in a conversation

The menu page for each conversation now includes two new download links giving access to tab-separated files and compressed PostgreSQL table dumps for the word data.

2018 update: New book published on the Siarad corpus

Building and Using the Siarad Corpus: Bilingual conversations in Welsh and English (Margaret Deuchar, Peredur Webb-Davies, and Kevin Donnelly) has been published by John Benjamins. The first part of the book describes the methods used to build the first sizeable corpus of informal conversational data collected from bilingual speakers of Welsh and English: Siarad. The second part describes the linguistic analysis of data from this corpus.

2018 update: Linear versions of the Siarad files

The .cha format is tiered, with different lines in the file reflecting a different attribute of the text. To help those using simple concordancers, a linear version of the files is now available, in three flavours:

The links above will download a tarball of all the files.

2013 update: Search for a word across all conversations

A new search page is available, which returns 20 instances of a word from all conversations in the Siarad or Patagonia corpora. The conversations are combined into one file, but some information such as glosses and (optionally) transcription marking is removed.


A number of publications and presentations have resulted from mining the corpora for the linguistic information they contain.


The researchers have received input and assistance from a variety of collaborators around the world. We have also received help in translating the Miami corpus from a number of people, listed on this page.


Our corpus material is transcribed and annotated using the CHAT and CLAN applications developed by Prof Brian MacWhinney and Leonid Spektor at Carnegie Mellon University. Our Siarad data is also available via the Talkbank portal (although the version there differs slightly from the one on this website.)

To gloss the Miami and Patagonia corpora we are using autoglossing software we have developed in-house. To mine all three corpora we are using a variety of techniques, including the output from the autoglosser.


The ESRC Centre has collected these materials following the ethical guidelines set out in the Talkbank Code of Ethics.


The material on Talkbank and on this site is available under the Free Software Foundation's General Public License. This means that you can access it freely and use it however you like. We would be grateful, however, if any such use could also acknowledge the ESRC Centre.

Change language

Contact us


The corpora

The Siarad corpus
The Patagonia corpus
The Miami corpus

Research Team



Bangor Autoglosser


The support of the Arts and Humanities Research Council (AHRC), the Economic and Social Research Council (ESRC), the Higher Education Funding Council for Wales (HEFCW) and the Welsh Government is gratefully acknowledged.