Parallel corpora as tools for investigating and developing minority languages

Trond Trosterud,

University of Tromsø

Abstract

The article consists of a principled discussion of how parallel corpora can be used when working with grammatical documentation and lexicographic and terminological language planning for minority languages. An important point is that explicit tools are needed to transfer results achieved in majority language research over to other languages which generally have fewer resources available.

 

1. Introduction

Corpus linguistics is dominated by English, and the work on parallel corpora is

 no exception. Stig Johansson (Johansson 1998) presents the Nordic work on parallel corpora as a wheel, where English is the hub, and where all relations are expressed along spokes via English to the more peripheral languages, as shown in the figure below (op.cit. p. 9):

 

Figure 1 Nordic work on parallel corpora (cf. Johansson 1998:9)

In work on parallel corpora, techniques have been developed for contrastive syntactic analysis, for investigation of both manual and machine translation techniques, and for terminological and more general lexicographic work. These are resources that are needed for all languages, and in particular for minority languages, which do not have the advantage of having many linguists available. Rather than being moralistic when all the resources (in this case even Nordic resources) literally are concentrated on English, I want to focus on methods for connecting the end nodes in the graph above directly together, thus creating a true network. In this case, we will be able to utilise the ground work that has been done, in this case at the English-Norwegian Parallel Corpus project.

Section 2 will establish a typology of language pairs, and section 3 will look at status quo for parallel corpus work within some of the types referred to in the previous section. Then I present sections on grapheme encoding, on lexicography, on challenges resulting from/based on a broadened typological variation among the languages involved. Finally comes a summary.

2. A typology over language pairs

Two important factors governing languages’ relation to each other are their degree of expandedness and their areal coexistence. An expanded language is a language used for all formal purposes in running a modernised society. Each cell is numbered, from 1 to 6, and each will be treated separately in the text below.

 

Table 1 A typology of language pairs

 

expanded/expanded

expanded/unexpanded

unexpanded/unexpanded

Spoken in the same area

1. Finnish/Swedish, French/Dutch, …

2. Swedish/Northern Sámi, English/Maori, …

5. Lule Sámi/Northern Sámi

Not spoken in the same area

3. Swedish/English, …

4. English/Northern Sámi

6. Lule Sámi/Maori

 

Each of these types has its own characteristics, relevant to the work on parallel corpora. These will be treated separately in the text below

2.1. Type 1: Expanded languages spoken in the same area

Language pairs of this type are characterised by a good stock of parallel texts, from all textual genres, fiction, technical, scientific, and administrative texts, which often are required to be available in both languages. Cultural terminology is parallel, often with translation loans from the most dominant language. A typical effect of this is that when a language is spoken in several countries, terminological differences will emerge as a result of different linguistic settings. Thus, the Finland-Swedish term hemvårdsstöd ‘home care support’ (a monthly sum paid to parents who do not use municipal kindergartens) is a literal translation of the parallel Finnish term kotihoidontuki, and is a term that does not exist in standard Swedish. Its Norwegian cognate kontantstøtte ‘cash support’ is both semantically opaque and does not, strictly speaking, have the same referent, since the conditions for availing of these systems differ. A similar lexical field may be names for educational degrees, where the referents of the terms in coexistent languages are identical. Coexistent expanded languages also have good dictionary resources, as is the case for both the language pairs quoted above. The dictionaries are often symmetrical, thus e.g. in the Swedish/Finnish case they pay equal attention to the Swedish and the Finnish user.

 

2.2. Type 2: Expanded and unexpanded languages spoken in the same area

In such settings, there will be many fewer texts available for the unexpanded language. This language will, in Europe, typically be a minority language, but, for example, in Sub-Saharan Africa, all majority languages fall into this category, forming pairs with the dominant language of the former colonial powers. If the country has good language legislation, there will be many parallel administrative texts available, and most or all of the translations will be from the extended language. This is the case in the Sámi setting, where politicians and bureaucrats have had their formal education in the majority language. They write and read their documents in the majority language, and the bilingual goals are taken care of by professional translators, translating piles of documents into the minority language. As an exception to this, fiction may be available in parallel forms with the unexpanded language as the primary language. Language expansion beyond the level of school primers often begins when creative members of the minority community find that they want to express themselves in their mother tongue, and realise that if they want to reach the majority community as well, their work needs to be available in parallel texts. The unextended language will typically lack (large parts of) the terminology needed in a modern society, it may have a rather young literary language, and its users may have received little or no education in or via their mother tongue. More often than not dictionaries will be available from the unexpanded to the expanded language.

Taking Norwegian/Northern Sámi as an example, there are now established routines for electronic storage of newspaper texts, and publishers have routines for saving their publications electronically. The missing genre is administrative language. Since Northern Sámi functions in bilingual societies, most or all administrative texts and most fiction is available in two languages, often also in three (Northern Sámi, Norwegian/Swedish and Finnish). For Northern Sámi, which has no monolingual domain, the majority of the available text corpus (except newspaper texts) will simply be bilingual, translated either from or into Northern Sámi. Within some genres, almost all translations will be into Northern Sámi. At the same time, the total corpus is very small, compared to what we find for the state languages. One consequence of this is that, in order to get large corpora, we must accept both a lack of balance between source- and translated texts and more heterogeneous corpora than for larger languages.

Due to lexicographical and terminological work, at least, size will be more important than homogeneity. Again Northern Sámi behaves as expected: We have had a large Northern Sámi–Norwegian dictionaries since the thirties, but (at the time of writing) still no large Norwegian–Northern Sámi Dictionary. The existing dictionaries are typically asymmetrical, giving grammatical information about the unexpanded language only.

 

2.3. Type 3: Expanded languages not spoken in the same area

For language pairs like English/Swedish there are many parallel texts in both directions, although there are far more texts from the larger to the smaller language. The dominating genre is fiction, to a certain extent also technical and scientific texts (popularised texts written for a larger lay audience will probably be translated more often). The genre more or less missing is administrative texts; due to a non-overlap in geographical area it is also little need for parallel texts here (European Union documents are the exception). So, when Johansson reports difficulties in finding enough translated texts form Norwegian into English, this is due to a missing overlapping domain. With increased migration, an increased number of texts will appear written in the domestic language and in the major immigrant languages (thus both in English and in other expanded and unexpanded languages), but the point is that at least until now this has been a small subset of the body of official text. Government white papers, reports from ministries and local administrative bodies, etc., are not usually translated into languages not in use in the areas they cover. Correspondingly, there also is much less done on establishing terminology for e.g. Swedish phenomena in English and German. The dictionary situation between expanded languages is very good whenever there is a school market (e.g. between Swedish and English/German/French), but such dictionaries are systematically asymmetrical, favouring the language learners. Overall, dictionaries between expanded languages are rapidly improving due to better lexicographical resources, and language pairs like e.g. Swedish/Dutch typically result in symmetrical dictionaries.

 

2.4. Type 4: Expanded and unexpanded languages not spoken in the same area

Language pairs like English/Northern Sámi have very few parallel texts, limited to translations of the world literature into the unexpanded language, to important international treatments, and to translations of mythological texts etc. into the expanded language, for linguistic or literary reasons. In cases where the coexisting majority language community has not been able for one reason or another to conduct research on its minority language by itself, dictionaries and grammars are probably written in German or in another colonial language. For many languages there exists a large body of anthropological texts, creation myths, charms, riddles, and fairy tales, presented in phonetic transcription, and with parallel text translated sentence-by sentence into e.g. German or English. Thousands of pages of such texts are available for all circumpolar minority languages; indeed for most of them these text collections represent the largest written corpus available, and certainly the largest corpus representing the language at a stage before massive bilingualism and assimilation policies had taken their turn.

2.5. Types 5 and 6.

Language pairs like Lule Sámi and Northern Sámi (Type 5) may have a small set of parallel texts available whenever official language policy treats them similarly. When the Soviet Union developed a language policy for all its minorities in the 1920s and 1930s, it translated the same primers and texts into all the 30 small languages of the Northern Areas. Lexicon resources within type 5 languages are rare, but they exist. Type 6 is put into the table for completeness only, and will not be discussed further.

3. Parallel corpus work on the different language pair types until now

The prototypical parallel corpus falls within type 3 in the above typology, and it is more often than not made and maintained by a Department of English (or perhaps German) at a university in the country of the language that is paired with English. Thus, universities in each of the Nordic countries have parallel corpora between their own Nordic languages and English. This is a sort of inverted developmental aid, where the study of the single most investigated language of the world is being developed by linguists from other language communities. Despite the massive body of existing parallel texts within the type 1 group, at least within the Nordic countries, parallel corpus work on language pairs of this type has been non-existent, or atypical at best. Type 2 language pairs fare no better, despite both the fact that there are texts available, and that we may expect parallel corpus work to shed light upon the language contact phenomena which are expected due to massive bilingualism among the minority language speakers. In addition, terminological needs speak in favour of parallel corpus work within this group, as we shall see. As for the type 4 languages, some anthropological texts are electronically available both in the original and in translation, but research on these texts has so far not considered the original text and the translation as parallel texts in the way as is done in parallel corpus research. At least in the Nordic countries, minority languages have hitherto not been investigated by the aid of parallel corpora. 

 

4. Access to parallel corpus texts for minority languages

The total body of texts available for minority languages differs from the texts available for majority languages. Whenever the language pair types have overlapping domains, and the minority language enjoys literacy, parallel texts will still be available. For Northern Sámi/Norwegian it should be possible to get a parallel corpus as large as the ENPC corpus, but the criterion for choosing texts must be relaxed in several respects. Even Southern Sámi, a language with approximately 500 speakers, has a corpus of more than half a million words, and most of the texts are available in translation to or from Norwegian or Swedish. Minority language communities may be sceptical towards attempts by research institutions to come and “take the texts from them”. The copyright issue must also be dealt with. For new literary languages, no texts old enough to escape copyright legislation will be available.

 

 

4.1. Administrative texts

As pointed out above, administrative texts are translated from the majority to the minority language, for political and legislative reasons. This body of text is important, for several reasons. Recent texts of this type are stored on computers scattered around in the respective administrations, and an initiation of routines for collecting such texts should have high priority, in order to prevent text loss when computers are changed or data otherwise are erased. Due to the formal nature of the texts, one may expect that the language in these texts is not as close to an idiomatic use than are other genres. But when it comes to terminological work, these texts are of uttermost importance, since it is precisely these formal settings that call for the creation of new terminology.

 

4.2. Fiction

Fiction, on the other hand, is translated from the minority to the majority language, in order to get access to a larger readership. Children's books represent a tendency in the other direction: Here, majority language books are translated into the minority language for educational purposes. Recent novels will probably be available electronically. This is important for minority languages, since software for scanning of texts is not developed for such languages. Whenever the minority language in question contains unique or rare graphemes, these will make scanning of texts harder.

For research on syntax, fiction is an important genre, since this is where we may expect that the data come closest to actual usage.

 

 

4.3. Scientific texts

The ENPC has already reported problems finding a sufficient number of Norwegian–English scientific texts. There are very few scientific texts written in minority languages as compared to other genres, and whenever there are such texts, they do not necessarily have parallel texts in the majority language. Thus, creating a parallel corpus as balanced as the ENPC with regard to scientific texts is in most cases unrealistic. Still, existing texts should be collected, in order to support terminological work.

 

4.4. The Bible

The Bible is the most important parallel text available, with the whole Bible translated into approximately 320 languages, the New Testament into 900 languages, and Bible fragments being available for at least another 800 languages (Barbara Grimes, p.c.). This corpus is of course well-suited for parallel corpus research, as the paragraphs are already numbered and aligned. Still, there are problems with this text material. There are two schools of Bible translations, one aiming at delivering a translation as close to some original text as possible, whereas the other aims at a language as natural and idiomatic as possible. At least syntactic projects should hope for the former option, but otherwise stay away from Bible translations whenever possible. From the viewpoint of the typologist and the descriptive grammarian, I would like to see both parallel corpus software and linguistic work geared especially towards Bible texts, thereby making this huge text collection available to researchers in a more systematic way.

 

4.5. Summary

Minority languages differ from extended languages mainly in that the total body of texts is both much smaller, and distributed across fewer genres. This is problematic from a general linguistic point of view, since it will be harder to build a parallel corpus that constitutes a representative sample of the written manifestation of the speaker's language use. From this it does not follow that such corpora should not be assembled, but that the results emerging from work on them should be interpreted with caution. On the other hand, language planning and development need as many parallel texts from as many genres as possible, in order to capture as many aspects of actual language usage as possible.

5. Technical considerations

5.1. Text alignment programs

Basic software for parallel corpora are developed as parts of the parallel corpus work, and thus made for English and other majority languages. Still, existing software can quite easily be modified in order to cope with other language pairs as well. As an example illustrating this, I would like to report on work I have done to extend the Translation Corpus Aligner (TCA, a text aligner described in Hofland and Johansson 1998) to new language pairs, in this case Norwegian/Finnish. The program is structured in the following way: A window of 15 sentences is lead through the two texts to be aligned, with an overlap of five sentences. The texts are then compared according to different parameters. Most of these parameters are language-independent: the program records matching words with an initial capital, characters like colon, question mark, exclamation mark, certain tags, like start and end of division, heading of paragraph, etc., and the number of characters in the sentence. A separate component also extracted cognates automatically, by matching words that have a certain number of equal characters or digraphs. The program contains one language-dependent parameter, a so-called anchor word list, a list of  approximately 1000 translation pairs, excluding the most frequent words of the respective languages. Hofland and Johansson report an error rate of 1.98% for an alignment of 93,000 English/Norwegian sentence pairs.

In addition to the English/Norwegian TCA similar aligners have been made for the language pairs shown in Figure 1, thus also for English/Finnish. It turned out that the English/Norwegian and English/Finnish anchor word lists were made according to exactly the same English list of key words. In order to obtain a Norwegian-Finnish aligner, I simply put the two lists besides each other, and then removed the two English columns. The resulting list had to be edited, but the work got off to a flying start. The language-independent part of the aligner was simply kept constant. Possible improvements of the language-independent module could be to check whether there are systematic differences in sentence length, and whether the automatic cognate extraction should be set differently for Finnish/Norwegian than for English/Norwegian. Whereas the English/Norwegian TCA was tested on texts encoded according to the TEI standard (including paragraph marks, etc.), the Finnish/Norwegian TCA was tested on raw texts only, and the error percentage was higher than 1.98%. Still, the result of a couple of hours' work was a working TCA. It is thus easy to see how the hub and spoke of Figure 1 may be translated into a true network, and how, for instance, an additional Northern Sámi anchor list may be linked both to the Norwegian and the Finnish anchor word lists, thus creating aligners for two language pairs at the same time.

The conclusion is that the transfer value of work already done on type 2 language pairs is large, even when it may seem that it has a language specific character.

5.2. Encoding

An often underestimated problem is the encoding of letters that are not found within the repertoire of ISO/IEC 8859-1 (Latin 1), since tools used to analyze languages are, to date, limited to that repertoire. Most Sámi written languages have letters that are not included in Latin 1. The corpus compiler must make principled choices and establish routines for text transposition. Preferably, the chosen standard should be identical to other corpora for the same language. One fallback option is to make ad hoc digraphs for the letters in question (this was done for the only electronically available Northern Sámi corpus, at UHCS in Helsinki); another one is to use arbitrary signs from within the available code table (‰. £, $, ™, etc.), or to use other code tables (e.g., Latin 4 in this case). In the long run the only viable solution to this problem is to code all text making use of the Universal Character Set (ISO/IEC 10646 or Unicode), a standard that is intended to contain all characters used to render natural languages. In the short run the best solution will probably be to use a tailored 8-bit code table, and make sure that automatic conversion to the UCS is possible.

6. Linguistic considerations

6.1. Grammar

From a purely linguistic point of view, the lack of work on type 1 and type 2 language pairs within parallel corpus linguistics is unfortunate. Given that parallel corpora are suitable for research on language contact phenomena, such situations can be found whenever we have bilingual speakers, that is, whenever languages are spoken in the same area. Type 3 language pairs, on the other hand, lend themselves to contrastive research and studies on translation theory; because of this, as expected, this kind of work has dominated linguistic work on parallel corpora until now.

Due to their rich body of parallel texts, the type 1 language pairs are the ones that are most likely to offer large and representative parallel corpora, and whenever two expanded languages are spoken by a bilingual population, parallel corpus work should be expected to give valuable input to the study on language contact. For type 2 language pairs we will not be able to make so good parallel corpora, but the influence from the majority language grammar is probably larger on speakers of unexpanded languages than it is on speakers of expanded languages. To take just an example, since bilingual Finland-Swedish speakers have had their formal education in Swedish, their Swedish will constitute an independent grammatical system to a larger extent than will the language spoken by Finnish Sámi speakers, for whom Sámi has been restricted to the domestic sphere. Parallel corpora including minority languages may thus offer  insights into language contact situation.

Parallel corpora may also provide input for machine translation projects, although the usefulness of parallel corpora partly depends upon the choice of MT technology. Machine translations for minority languages may seem utopian at a moment when even majority languages possess only bad MT systems. Still, in future, bilingual administrations will be dependent upon machine translation, and the foundation for such systems has already been laid today.

 

 

6.2. Lexicography and terminology

The most important practical application for parallel corpora to and from minority languages is terminology development and lexicography. Minority languages in a modernised society need dictionaries with the majority language as primary language, and the challenge facing this work is development of terminology and extension of vocabulary. A screening of actual use of new terms will evidently speed up this work. Dyvik 1998 sketches how parallel corpora may form a starting point for work within lexical semantics. By picking the translations of the translations of lexeme x, removing x from the set so obtained, gathering the translations of the members of this set, and organising the result into a set of more or less overlapping sets of translations, one gets both a proposal for a semantic structure for lexeme x, and a systematised list of the terms that actually are in use as different translations of x. Automatised and combined with morphological parsing and checking of new lemmas towards a lexicon, this procedure identifies new terms and their translations, and it may thus be an effective aid in terminological work.

In practice, writing dictionaries that translate from majority into minority languages has proven to be very difficult. To take our example languages again, in the autumn of 1999 there exists no dictionary larger than the school dictionary level to Northern Sámi from any Scandinavian language, and the only slightly larger dictionary that translates into Northern Sámi (Sammallahti 1993, from Finnish) also contains no more than 20000 words. The typical case is that outsider linguists make dictionaries from the minority into the majority language (or an international language), in order to be able to conduct research on the language in question.

 

7. Conclusion

The general usefulness of parallel corpora is amply illustrated in the literature, both in this volume and elsewhere. Here I would like to summarise just why minority languages should participate in parallel corpus work, despite the shortage of texts as compared to texts for expanded languages.

The most important contribution from parallel corpus work will clearly be delivered within the field of terminology and lexicography. No language can function as an administrative language for a modernised society without both a developed terminology and means of accessing it. Work within terminology and lexicography play a key role here, and parallel corpora are able to make this work far more efficient than is the case today.

In addition, research on bilingualism and language contact would benefit from parallel corpora involving minority languages, since in these situations we often find a high degree of bilingualism.

As for the technical problems, work done on majority language pairs should be utilised whenever possible. The problems of character encoding will hamper corpus linguistics for minority languages in future as well, but in principle, the solution is clear: text should be encoded according to the Universal Character Set.

8. References

Dyvik, Helge 1998: A translational basis for semantics. Johansson & Oksefjell (red.) s.51-86.

Hofland, Knut 1996: A program for aligning English and Norwegian sentences. S.Hockey, N.Ide & G.Perissinotto (red.): Research in human computing. p. 165-178. Oxford: Oxford University Press.

Hofland, Knut & Stig Johansson 1998: The Translation Corpus Aligner: A program for automatic alignment of parallel texts. Johansson & Oksefjell (red.) s.87-100.

Johansson, Stig & Signe Oksefjell (red.) 1998: Corpora and Cross-linguistic Research. Theory, Method and Case Studies. Rodopi: Amsterdam-Atlanta.

Johansson, Stig 1998: On the role of corpora in cross-linguistic research. Johansson & Oksefjell (red.) s.3-24.

Sammallahti, Pekka, 1993: Sámi-suoma-sámi sátnegirji = Saamelais-suomalais-saamelainen sanakirja. Ohcejohka: Girjegiisá.

The University of Helsinki Language Corpus Server (UHLCS). http://www.ling.helsinki.fi/uhlcs/index.html