by Christina Wasson


As Susan Kung described in her CELP blog post, language archives are online repositories that allow people around the world to access data on endangered languages. Language archives are potentially valuable resources for a variety of groups. Most importantly, they provide information that can help indigenous groups engage in language preservation and revitalization. In addition, they can provide data for linguists conducting research, as well as for researchers in other fields, students, and even artists.

Starting in the early 2000s, more and more language archives were created as indigenous groups and linguists awoke to the crisis of language endangerment, while at the same time burgeoning Internet technologies made online repositories easy to build. The creation of new language archives is ongoing.

Language archives have evolved a great deal in their short existence. Many archive managers have put a tremendous amount of effort into developing increasingly sophisticated sites. A large amount of linguistic data on many languages is now online.

One of the ways in which language archives can continue to develop is by conducting research with their intended user groups, to identify ways in which their existing or planned offerings may not be a perfect fit for the needs of these users. Such research and its application is often termed “user-centered design.” Designing or re-designing language archives to more closely match their users’ needs could increase the number of people who make use of archives, and it could make archives more useful to these people.

In collaboration with multiple partners, I have embarked on a long-term research trajectory to bring the fields of user-centered design and language archives into dialogue. This activity appears to address a perceived gap; I have encountered considerable interest and enthusiasm both from managers of language archives and from archive users.


As a first step in this research trajectory, Gary Holton and I organized the Workshop on User-Centered Design of Language Archives, held in February 2016 (funded by NSF grants BCS-1543763 and BCS-1543828). This workshop brought together representatives of the following stakeholder groups:

1. Language communities

2. Linguists

3. Archivists

4. User-centered design practitioners

5. Funding agencies

Workshop discussions produced several valuable outcomes, including documentation of the diverse stakeholder group perspectives, creation of a language archive typology, and identification of current access issues.

Figure 1

Figure 1. Workshop on User-Centered Design of Language Archives

Exploratory User Research for CoRSAL

The second step in this research trajectory was a project conducted by my fall 2016 Design Anthropology class. They conducted user research for a future language archive planned by Shobhana Chelliah. This archive will collect data on Tibeto-Burman languages. We are currently referring to it as CoRSAL, the Computational Resource for South Asian Languages. (The name may change in the future, so “CoRSAL” should be thought of as a placeholder.)

I will present a few of the key findings from this project, to show how an understanding of user needs can contribute to the design of a language archive and encourage its use. However, I want to emphasize that while these findings are suggestive, they are not definitive. We are still far from the long-term goal of developing guidelines or best practices for the user-centered design of language archives. The class project was limited in scope, and targeted to the particularities of one language archive.

The research conducted by my class was structured to identify the needs of the four main user groups for whom CoRSAL is intended:

  1. Communities that speak Tibeto-Burman languages (represented for this project by the Lamkang in the Indian state of Manipur)
  2. Linguists who will use CoRSAL as a source of data
  3. Computational linguists
  4. Depositors, i.e. linguists or community members who will place their data in the archive

Students conducted in-depth, semi-structured interviews with representatives of these four groups, for a total of 16 interviews that typically lasted 1-1 ½ hours. Interviews with members of the Lamkang community were conducted by phone and audio recorded. Interviews with other participants were conducted face-to-face or by Skype, and videorecorded. Fieldnotes or transcripts were produced for all interviews, and analyzed with the help of Dedoose, an online qualitative analysis program.

Following are brief highlights of the class’s findings about the needs of each user group.

Needs of Language Communities

  • If a community’s language was historically not written, its orthography may still be under development, and/or more than one writing system may exist. So a preliminary need may be the development of a standardized orthography. Written materials are only useful if community members can read them.
  • By the same token, the community may have a need for pedagogical materials to help teach members how to read and/or speak their language. Data in language archives are often formatted for use by linguists and other researchers, rather than for language learning.
  • For many indigenous groups, Internet access and computer access may be limited. This constraint needs to be addressed in the design of online archives in order to ensure their usefulness to such indigenous groups, for instance by offering printable documents.

Needs of Linguists Who Will Use CoRSAL as a Source of Data

  • Linguist researchers want to be able to quickly examine a language archive and find out if it includes corpora that are useful for their research agenda. One of the most prevalent challenges for researchers is that relevant corpora in language archives are often difficult to find and access. Interface design, browse, and search functions are common sources of difficulty.
  • Efficient searches for data are also hindered by the lack of metadata standardization across language archives, although work is being done in this area.
  • Linguists need to be able to easily interpret the data in a corpus in order for the corpus to be useful in their research. The lack of standardization in annotation practices can make this difficult. Furthermore, many corpora do not include adequate descriptions of the author’s annotation system.
  • One of the most exciting potential uses of language archives for linguistic research would be the ability to compare parallel data across languages. However, at present, language archives are not set up to enable this.

Needs of Computational Linguists

  • Computational linguists who wish to use language archive data for their research face the same issues as other kinds of linguists.
  • In addition, they need data in a machine readable format; pdf and Word files are particularly difficult to work with.

Needs of Depositors

  • The logic of archiving is at odds with the logic of linguistic analysis. Archives seek the deposit of finalized, permanent documents. But linguistic analysis is never final; it is an ongoing process. The difficulty of updating deposited materials is a significant deterrent to potential depositors.
  • Depositors need to have a depositing process that is as quick and easy as possible, since their time is limited.
  • Linguists are often concerned about releasing data to the public before they have fully published their analyses, because they don’t want other researchers to potentially preempt their findings. It would be helpful to develop ways to protect the intellectual property rights of depositors, and ensure that they were recognized in publications using their data.

The students in my Design Anthropology class presented a more exhaustive set of findings to the CoRSAL development team than the summary presented here. Furthermore, in spring 2017, a design class at the Illinois Institute of Technology’s Institute of Design used the research from my class to develop interface design prototypes for CoRSAL. This class was taught by Santosh Basapur. Another set of design students have continued to work over the summer under Santosh’s guidance. So this is an ongoing project.

Santosh Basapur’s Students (on Right) Presenting to UNT Group (on Left) via Videoconference Santosh Basapur’s Students (on Right) Presenting to UNT Group (on Left) via Videoconference

Figure 2. Santosh Basapur’s Students (on Right) Presenting to UNT Group (on Left) via Videoconference


My hope is that the summary of findings listed above makes a case for the value of conducting research on the needs of user groups of particular language archives.

Here are some further information resources:

  • The client report from my Design Anthropology class can be accessed from this page: It is the Al Smadi et al. 2016 publication. You will also find other relevant publications on that page.
  • A website on the user-centered design of language archives developed by me and various partners is here: It was initially created to accompany the Workshop on User-Centered Design of Language Archives described above, and includes publications that developed from the workshop.

Thank Yous

The research trajectory described here has involved numerous collaborators – I am grateful to all of them! They include: Gary Holton, the participants in the Workshop on User-Centered Design of Language Archives, my amazing research assistants Heather Roth and Emma Nalin, the students in my fall 2016 Design Anthropology class, Santosh Basapur, Santosh’s students in spring and summer 2017, Shobhana Chelliah, and the rest of the CoRSAL development team. A heartfelt thank you to everyone, and I look forward to continued collaborations!