Welcome to BangorTalk

The ESRC Centre

This site holds the conversational corpora assembled by the former ESRC Centre for Research on Bilingualism in Theory & Practice at University of Wales Bangor.

We are seeking to gain a greater understanding of how bilingual individuals in a variety of communities manage both their languages within the same conversation.

The questions we consider include:

Do bilinguals in different types of communities handle their two languages in different ways in conversation?
How do social variables such as class, age and gender affect the way people handle their two languages in informal conversations?

The corpora

To date, we have assembled three corpora:

The Siarad Welsh-English corpus (documentation, questionnaire data). The data here is version 1.5, and includes both manual and automatically-generated glosses. Version 1.0 of the Siarad corpus was originally distributed on CD in 2009, and the texts, which include manual glosses only, are downloadable here. Version 2.0 of the Siarad corpus, including corrections and emendations, and improved autoglossing, is in preparation. The working files are available in a GitHub repository.
The Patagonia Welsh-Spanish corpus
The Miami Spanish-English corpus (documentation, questionnaire data).

Summary data for each corpus:

Welsh English Spanish indeterminate Total (words)

Siarad 84% 4% --- 13% 447507

Patagonia 78% <0.5% 17% 5% 193102

Miami --- 63% 34% 3% 235871

	Welsh	English	Spanish	indeterminate	Total (words)
Siarad	84%	4%	---	13%	447507
Patagonia	78%	<0.5%	17%	5%	193102
Miami	---	63%	34%	3%	235871

Licensing

The material on this site is all available under the Free Software Foundation's General Public License v3 (or later). This means it can be used freely, adapted and extended as required by the user, subject to the same GPLv3 (or later) licence being used for any derived version that is distributed. We would be grateful, however, if derived versions could acknowledge the ESRC Centre.

Siarad v1.5 (the autoglossed version on this website) is additionally licensed under the CC-BY-SA licence, requiring attribution, with the same licence being used for derived versions.

Patagonia, Miami, and Siarad v1.0 (the original version with manual glosses only, downloadable here) are additionally licensed under the CC-BY licence, requiring attribution.

The equivalent material on Talkbank is all available under the CC-BY-NC-SA licence, requiring attribution, no commercial use, and the same licence being used for derived versions.

The choice of licence for each corpus (GPLv3 or later, Creative Commons) is left to the user.

2019 update: New licensing options for Siarad

The Siarad corpus was originally published as a CD in 2009, under the GPL2 licence. In this version, Siarad v1.0, the transcripts contained only manual glosses. As of August 2019, this manually-glossed version (download) is being made available under a dual licence, GPLv3 (or later) or CC-BY, with the choice of licence left to the user.

The version on this website, Siarad v1.5, contains both manual glosses and autoglosses. As of August 2019, this autoglossed version is being made available under a dual licence, GPLv3 (or later) or CC-BY-SA, with the choice of licence left to the user.

2019 update: New licensing options for Patagonia and Miami

As of August 2019, the Patagonia and Miami corpora are being made available under a dual licence, GPLv3 (or later) or CC-BY, with the choice of licence left to the user.

2019 update: tsv and sql files for words in a conversation

The menu page for each conversation now includes two new download links giving access to tab-separated files and compressed PostgreSQL table dumps for the word data.

2018 update: New book published on the Siarad corpus

Building and Using the Siarad Corpus: Bilingual conversations in Welsh and English (Margaret Deuchar, Peredur Webb-Davies, and Kevin Donnelly) has been published by John Benjamins. The first part of the book describes the methods used to build the first sizeable corpus of informal conversational data collected from bilingual speakers of Welsh and English: Siarad. The second part describes the linguistic analysis of data from this corpus.

2018 update: Linear versions of the Siarad files

The .cha format is tiered, with different lines in the file reflecting a different attribute of the text. To help those using simple concordancers, a linear version of the files is now available, in three flavours:

Top tier only - the "surface" tier of the text, without any header information, gloss lines, comments, etc.
Autoglossed - the automatic glosses are appended to the word, separated by a "£" sign; any markings (eg hesitations, backtracking) are not glossed.
Human - the manual glosses are appended to the word, separated by a "£" sign; any markings (eg hesitations, backtracking) are not glossed.

The links above will download a tarball of all the files.

2013 update: Search for a word across all conversations

A new search page is available, which returns 20 instances of a word from all conversations in the Siarad or Patagonia corpora. The conversations are combined into one file, but some information such as glosses and (optionally) transcription marking is removed.

Publications

A number of publications and presentations have resulted from mining the corpora for the linguistic information they contain.

Collaborators

The researchers have received input and assistance from a variety of collaborators around the world. We have also received help in translating the Miami corpus from a number of people, listed on this page.

Tools

Our corpus material is transcribed and annotated using the CHAT and CLAN applications developed by Prof Brian MacWhinney and Leonid Spektor at Carnegie Mellon University. Our Siarad data is also available via the Talkbank portal (although the version there differs slightly from the one on this website.)

To gloss the Miami and Patagonia corpora we are using autoglossing software we have developed in-house. To mine all three corpora we are using a variety of techniques, including the output from the autoglosser.

Ethics

The ESRC Centre has collected these materials following the ethical guidelines set out in the Talkbank Code of Ethics.