Funny characters on the net: How information technology may (or may not) support minority languages

[*]

Trond Trosterud, Barents Secretariat, Kirkenes.

BAR-IT conference, Apatity, 16.-20.9. 1996.

With present and hopefully not-too-distant forthcoming computer technology, messages can be sent using 7, 8 or 32 bits. I will shortly go through how each of these possibilities may or may not transmit the letters of the languages in Northern Eurasia. Then I will concentrate upon an evaluation of actual and possible solutions for the languages of the Barents Region: Scandinavian, latin-based Sámi languages, Russian, Kildin Sámi and Nenets[1]. But first I will have a closer look at the cultural aspects of different choices of character set technology.

2. The cultural aspects of character set technology

I see the following points as crucial for an evaluation of the cultural aspects of different solutions to the problem of character set technology:

* Internet will become both cheaper and more available in the near future

* Internet is a very good long-distance communication device

* Internet's role as a local information channel will grow as it becomes more wide-spread

* Internet lowers the threshold for publication

I will comment upon them one by one, in the order mentioned.

2.1. INTERNET WILL BECOME BOTH CHEAPER AND MORE AVAILABLE IN THE NEAR FUTURE

At a first thought, neither the Sámis in the Nordic countries nor the national minorities of Russia are the prototypical Internet users. This will probably change. In the Nordic countries, Internet is making its way into primary and secondary school education, and Internet servers are being established also outside the universities. In Russia, all the indigenous peoples inhabit areas that are rich in natural resources, oil, gas, salmon rivers, etc. Foreign compensation for eventual natural damages may very well give us a situation with modern computer equipment in every Nenets kolkhos, in the same way as we have seen it in Alaska and Canada. Also, Internet becomes cheaper and easier to use. The development of combined web browsers and television sets that today is in its initial phase, will make Internet as wide-spread as is Text-TV today, first in cable-TV networks, later also in other TV networks. "Web-PCs", i.e. personal computers that can be used only as terminals for Internet browsing, will cost appr. $ 300,-, and their prices will probably drop even further.

2.2. INTERNET IS A VERY GOOD LONG-DISTANCE COMMUNICATION DEVICE

In the 80ies, we witnessed the birth of global cooperation of indigenous peoples, the so-called fourth world movement. Typically, non-assimilated indigenous peoples live distant from communication centres, with slow postal services, etc. Internet already plays a major role in the global communication between different indigenous peoples' organisations, even a more central role than it does for majority population organisations, since in this case the ordinary communication alternatives are so much worse.

2.3. INTERNET'S ROLE AS A LOCAL INFORMATION CHANNEL WILL GROW AS IT BECOMES MORE WIDE-SPREAD

Whereas the usual use of Internet today is world-wide communication between the few, its role in local communication will grow as the number of users grow. Schools, municipalities, local shops, etc. will start using the Internet for local information and advertising, and not only to say "hello, here we are!" to the outer world.

As this happens, the use of other languages than English will grow. Choice of language is always dependent upon the receiver. Until now, the dominant position of English has been due to the fact that the receiver always has been an inhabitant of the big, anonymous Netland. When the receiver becomes the potential customer, the inhabitant of the municipality, etc. it is neither necessary nor suitable to write in English. Correspondingly, the choice of local language becomes more important.

2.4. INTERNET LOWERS THE THRESHOLD FOR PUBLICATION

In order to publish publicly available messages regularly one has to control a TV or Radio station, or a newspaper (or magazine). All this is very expensive. Access to a web server costs only fractions of what it costs, e.g. to have a local newspaper printed or a radio program broadcasted. Thus, it will become easier for members of small language communities to create their own Bürgerliche Öffentlichkeit, their public space for discussion and exchange of ideas, and, if the technology admits it, in their own language.

2.5. LANGUAGE DEATH, TV AND INTERNET

TV has been called "cultural nerve gas". It may safely be regarded as one of the main factors (in cooperation with the public school system) leading to a situation where 50% of the worlds languages are threatened by extinction (as compared to appr. 3-5% of plants and 5-7% of the animal species). To take a relevant example, the introduction of national TV in the Skolt Sámi village Sevettijärvi in the early 70ies coincided approximately with a sharp decline in the use of Skolt Sámi as the major daily communication language among children.

A relevant question at this point is whether one should care. Isn't one of the good things about Internet the strengthening of English as a world-wide communication language, with a monolingual world without language barriers and communication problems as the ultimate result? Never mind how indifferent Norwegians, Finns or Russians might be as to the future of the minority languages in their own countries, I think that we all will agree that a monolingual world is not a desirable state of affairs. Language is the one single most important factor distinguishing humankind from animals, and the core of our civilisation is being brought to us through our different mother tongues. For every language that is lost, one distinct pair of glasses through which the world can be seen is lost with it.

Most language minorities of the world have never received formal instruction either in or about their mother tongue. As a result of this, members of these minorities often complain they speak a "difficult" language, it is hard to read and next to impossible to write, etc., compared to how "easy" it is to use the majority language. This is of course due to the school system, but not only so. Mastering of orthographic rules often (especially when it comes to bad orthographies, as in the case of English or French) relies heavily upon frequent reading of forms and subsequent remembering by heart. Thus, languages seldom seen in print become "difficult languages".

It is thus of uttermost importance for language minorities to conquer the new and important Internet territory. At the moment, things do not look too bright. As an example I can mention the Northern Sámi on the Internet. Today, very few Sámi institutions are on the net. Four of them are the Swedish Sámi Parliament, the Sámi software producer Vplan, the Sámi College in Guovdageaidnu, and the leading Sámi publishing house Davvi Girji. All of them publish in Scandinavian languages and in English, none of them in Sámi. This is especially bad in the latter two cases, since knowledge of Sámi is necessary in order to follow the education or read the books. In addition to reducing the amount of Sámi text available to the public, they also cut off potential Sámi readers in Finland (the reverse situation applies for Sámi sites in Finland written in Finnish).

All these Sámi institutions are devoted to the support of the Sámi language, and they use Northern Sámi as administrative languages. The reason why they do not use Sámi on the net is technical: They (or their Internet consultant) do not know how to put Sámi on the net, or they do not feel confident that the potential readers will be able to read the Northern Sámi characters.

Still, work has been done to make it possible to read Sámi and other minority languages on the Internet, and even to make standards for rendering all written languages. The next sections will give further information about this work.

3. Standardisation work

Formal character set standards are defined by joint work by the International Standardisation Organisation (ISO) and the International electrotechnical Commission (IEC), thorough their JTC1/SC2 (Joint Technical Committee 1 on Information Technology, Subcommittee 2 on Character set technology). Both representatives from the industry and linguists participate in this work. The official goal of these bodies is to make possible a rendering of all written languages in the world. SC2 has two Work Groups, WG2 dealing with the ISO/IEC 10646 standard, or 32-bit character sets and WG3, dealing with the ISO/IEC 8859 standard, on 8 bit character sets.

3.1. WORK GROUP 3 AND ISO/IEC 8859

ISO/IEC 8859 is a set of 8-bit character set standards, the most well known being 8859-1, or Latin 1, the 8-bit Internet standard[2].

Fig. 1. The left part of ISO/IEC 8859-1

The left part of 8859-1 is identical to the ASCII standard, or the 7-bit standard for English. The table is read as follows: Each symbol is represented by its co-ordinates, so that A = 41, a = 61, ] = 5D, etc. All code tables presented in this article have the code table in Fig. 1. as their left part. No deeper understanding of complicated computer technology is needed to read these tables. The only information they give is the link between what we see on the screen (the symbol) and what the computer "sees" (the number linked to the position of that symbol). The only thing that matters is that a specific letter will only survive from one computer to the next (or from one font to the next) if it is placed in the same position in both code tables[3].

The right part of 8859-1 contains most additional characters needed to write Western European languages:

Fig. 2. The right part of ISO/IEC 8895-1

Values for the symbols can be read of the table in the same way as for the left part, so æ = E6, À = C0, etc.

Latin 1 covers all major Western European languages, including the Scandinavian languages and Finnish. At the outset, three more Latin character sets were made, one for each part of Europe (Latin 2 (East), 3 (South), and 4 (North)) [see appendix for reference]. Latin 4 was later revised as 8859-10, or Latin 6. Included in Latin 6 was all the Latin-based languages in Northern Europe, including Estonian and the Baltic languages, and all the Sámi languages (except Skolt Sámi). The philosophy behind this organisation was thus that each corner of Europe should use its own standard, a standard identical for both language majorities and minorities. Unfortunately this did not work. Even though the majority language communities of Northern Europe were able to use Latin 6, they also could use Latin 1. Thus, instead of using a minority-friendly standard for solidarity reasons, they decided to use the Western European standard Latin 1. A similar process took place for Turkish: The Turks replaced the 6 Icelandic characters of Latin 1 with Turkish ones, and the result was accepted as an ISO-standard, Latin 5, deviating only minimally from Latin 1 (the 6 deviating cells are hatched on the code table below).

Fig. 3. The right part of ISO/IEC 8859-9

(Latin 5)

Due to the great number of Turkish immigrants (and small number of Icelandic ones), the Netherlands have chose to use Latin 5 instead of Latin 1 in the administration. Today there are three more subparts of 8859 out for voting in the relevant ISO/IEC committees, all of them are tailored for one language or a small group of languages: Thai, Vietnamese and the official languages of the Baltic Rim.

My conclusion to this is that regional standards are not necessarily the right strategy. Rather, the Sámi (as well as any other users) should try to be as mainstream compatible as possible. Sámi users want to read French and Spanish, and not Latvian and Lithuanian, regardless of the latter's Northern European location. Instead of viewing a given part of the 8859 standard as a container where one can dump as many characters as possible, one should see them as functioning wholes. Since 8-bits technology gives only the small 16 x 16 matrix to place the characters within, the character set for each language community should be construed to fulfil that community's needs.

Thus, a working committee for Standardisation of Sámi character sets[4] (with the generous help of Michael Everson) made a new standard (here called Latin 9 for reference) based on the following principles:

* The standard shall contain all latin-based Sámi characters (also Skolt Sámi)

* The standard shall be as close to Latin 1 as possible

* As many technical (non-letter) symbols as possible from Latin 1 shall be kept intact.

The goal was to alter as little as possible, but still cover all latin-based Sámi languages. As compared to Latin 6, what was sacrificed was the Latvian and Lithuanian characters. This sacrifice was easy to make, since the Balts do not use Latin 6 anyway. They have already rejected Latin 6 and made a unique Baltic standard, following the same principles as the ones outlined for Latin 9.

Fig. 4. The right part of Latin 9[5]

As a result of applying this philosophy in general, we end up with many tailored standards, Latin 1-compatible and containing more non-letter characters, instead of fewer regional standards, with many letter characters and correspondingly few non-letter characters and no Latin 1-compatibility. As long as we are dealing with an 8-bit space, this is also the best solution: The 8-bit character sets have only 96 positions (48 characters) available in addition to the basic A-Z alphabet. This is simply not enough to satisfy more than a couple of languages. The best solution is to give each language community as good standard as possible.

These solutions will hopefully be short-time solutions anyway: As the information technology community has become aware of the linguistic diversity of the world, it has become clear that 8 bits are not enough. The result of this insight was the 32-bit ISO/IEC 10646 standard, a standard that is meant to contain all the sound-, syllable- and word-signs of the languages of the world.

3.2. WORK GROUP 2 AND ISO/IEC 10646

With 32 bits at our disposal, we can define 4.294.967.296 distinct positions. In addition comes the possibility of combining more characters into one complex symbol. Thus, among standardises, ISO/IEC 10646 (with its foreseeable regular revisions) is meant as the definite solution to the character set problem. The character set is divided into 256 Groups, each containing 256 Planes, with 256 Rows that each have 256 Cells. The first Plane of the first Group the Basic Multilingual Plane (BMP) is already partially defined. BMP is intended contains, among other things, all presently used letter symbols of the world. Since BMP is in Group 00 and Plane 00, each character can be expressed by a four-digit hexadecimal number, i.e., with 16 bits. Thus, to pick just an example, Windows NT is already capable to handle the characters of BMP.

All latin-based Sámi characters are defined in the BMP, as are most cyrillic-based characters. Still, there are blank spots on the map: Several of the Kildin Sámi characters, as well as the characters of some additional minority languages of Russia are still missing from the BMP.

The problem with the cyrillic-based characters is that there are so many of them. The alphabetisation of the minority languages of the Soviet Union in the 20ies and 30ies was one of the great intellectual achievements in the history of mankind. Still, one cannot but wish that the eager communists weren't that creative in inventing new characters. When more than one language has the same sound not found in Russian, as, a rule, the literary languages contain different symbols for the same sound. The situation is compatible to the Scandinavian æ-ø / ä-ö -schism, just involving both more sounds and more symbols. As a result, the larger symbol set needs more space in the character set.

When the computer industry is going to move to this standard, is an open question. Not surprisingly, both the left (7-bit American) and the right (8-bit Western European) part of Latin 1 are true subset of the BMP. Thus, in Latin-1 "a" is 61 and "ä" ("a-umlaut") is E4, and in 10646 the values are 0061 and 00E4, respectively. Here again, as with the introduction of Latin 4 and 6, the question is : Why should the big ones bother? Why waste 16 bits when it is possible to write American with 7, Western European with 8, etc.? The answer must be both cultural and commercial: It must be a cultural demand to get computer technology that is an aid and not an obstacle to human activity, and it must be a commercial demand to get more advanced products. Luckily, ISO/IEC 10646 can offer more than just characters for small languages: It contains all the symbols of the rather big languages Chinese and Japanese, and it offers standardised characters like roman symbols i., ii., iii. , (defined as single number symbols, not as multiple instances of the letters i, v and x), etc.

4. Different implementation solutions

4.1. THE 7-BIT SOLUTION

Some non-English characters may be rendered on HTML: ä for ä, ø for ø, etc. In this way, many of the languages with few extra characters get along. This method survives 7-bits-channels, but it is cumbersome and restricted: Very few characters are defined this way, it is not exactly easy to use, and it is hard to write search strings.

4.2. THE 32-BIT SOLUTION, OR ISO/IEC STANDARD 10646-1

Here, all latin-based European and most cyrillic-based European characters are already defined. The problem here is a problem of use: Small languages can not afford to wait for the technology to be available on a broad basis. We simply need provisional solutions while waiting for the perfect one.

4.3. THE 8-BIT SOLUTION

ISO/IEC 8859-1, or Latin 1, makes it possible to write most Western European letter symbols directly on the net (including all Scandinavian languages, Finnish and Icelandic). ISO/IEC 8859-1 is becoming a de facto Internet standard. The strength of the 8-bit code table is that it is big enough to cover even the richest sound-symbol alphabets by adding only one extra bit to the 7 initial ones, and its weakness is that this one extra bit is not enough to cover all languages. Thus, the result must inevitably be quite a number of distinct 8-bit code tables, in itself a source to new confusion. As long as all potential readers have agreed to use a specific code table, there will be no problem, but for uninitiated readers it may be hard to know what code table to use.

5. The state of the art for the languages of the Barents Region

5.1. THE SCANDINAVIAN LANGUAGES

Since Latin 1 is the web standard, Finnish and Scandinavian PC users can write their texts directly. So can Mac users as well, provided they have a web font following Latin 1. PC users read web text directly, whereas the Mac versions of the browsers contain translation tables that automatically translate the Latin 1 web texts over to the Apple standard. Although Latin 1 has a solid position on the web, Scandinavian (and to some extent also Finnish) web text surprisingly often revert to the 7-bit ø for ø, etc. This is probably due to 7-bit based text-to-HTML translation programs, and to lack of knowledge about the possibilities of the 8-bit standard.

5.2. LATIN-BASED SáMI LANGUAGES

The Sámi languages are seldom seen on the net. Northern, Skolt and Inari Sámi contain 11 characters not included in Latin 1. One may think of two ways of making them Internet-readable: Either 7-bit HTML could be extended to these characters as well (along the lines of ä for ä, etc.), or Sámi should get an 8-bit standard of its own, in the same way as for Russian. The first solution is clearly not recommendable. It is doubtful whether HTML will be extended to cover the Sámi letters, and this cumbersome solution becomes even more so as the number of non-English characters grow. The second solution, and the solution that was actually chosen, is the Russian one: To make a Sámi code page (provisionally called Latin 9), make fonts available on the net, and agree upon using this and only this code page on the net.

The Latin 9 will, if it succeeds, replace Latin 4 in Finland and Winsam (a private standard based upon but crucially altering Latin 1) in Norway. Whether it will succeed or not is an open question, depending upon whether the user actually are willing to change the fonts of their own computers in order to get rid of the chaotic situation. User conservatism, and the producers eagerness to please their customers are thus the major obstacles. Factors supporting Latin 9 are the following:

* There is in the Sámi community a strong will to get rid of the chaotic situation, even if it means sacrificing one's own standard.

* The major Sámi software contributors have promised to upgrade the old fonts to Latin 9 for all registered users.

* Today's solutions are very expensive, ranging between NOK 1000 and NOK 5000 for each computer. The Working Committee for Standardisation of Sámi Characters has announced a competition for Sámi fonts, where the goal is to raise capital to buy the licence rights to a font, keyboard driver and translation program solution (following Latin 9), and then distribute high-quality freeware Sámi fonts to the whole Sámi community (through both Internet and a disc-distributing centres).

* During the next year, many important Sámi institutions (the most important probably being Sámi Radio) will publish Internet pages in Sámi.

All taken together, this might be enough to outweigh the conservatism of the users.

5.3. RUSSIAN

Russian teaches us that ISO/IEC status for a standard is seldom crucial. Instead of the ISO/IEC cyrillic standard 8859-5, a private, non-official standard (KOI8) is the de facto Russian standard on the net. Its position is not that dominant, though, for example, the web site of the Petrozavodsk University holds 3 parallel Russian standards (none of them being the ISO/IEC 8859-5, though). A further problem is the fact that, as far as I know, KOI8 is not in use outside the web. Russian texts must thus always be translated into web format.

Fig. 5. The right part of ISO/IEC 8859-5

Fig. 6. The right part of KOI8

From a language diversity point of view, KOI8's strong and 8859-5's weak position is regrettable, since KOI8, contains only the standard Russian alphabet, all additional characters are graphical symbols, whereas 8859-5 contains 28 non-Russian cyrillic symbols as well, symbols that makes it possible to write Bulgarian, Byelorussian, Macedonian, Serbian and Ukrainian, in addition to Russian. The fact that the latest version of Netscape (3.0) supports KOI8, 8859-5 (and Microsoft's CP 1251) is a welcome development: At least it will be possible to exchange information also containing the additional characters of 8859-5.

5.4. NON-SLAVIC CYRILLIC-BASED LANGUAGES

The situation for the cyrillic-based minority languages of the Barents Region is not as bright as it is for the latin-based Sámi languages. For them, KOI8 and ISO/IEC 8859-5 are equally useless. An ideal cyrillic Internet code page should contain also the latin A-Z symbols, and the only positions left are the 96 positions in the 8-bit area of the code page. Of these, 64 are occupied by the basic cyrillic alphabet. Of the remaining 32, at least some should go to technical symbols like [[section]], deg., ", " and the like, and we are left with appr. 20-30 free positions, allowing for 10-15 non-Russian letters. But whereas the construction of written languages for all the minority languages of the Soviet Union in the 30ies was one of the most significant contributions the Soviet empire made to human history, the result has some serious flaws seen from a standardiser's point of view. The most annoying one was the creativity in the construction of new letter symbols. The current Kildin Sámi orthography contains 19 symbols not included in the basic cyrillic alphabet. Adding length marks on the vowels gives another 24 symbols. The Nenets orthography is not that problematic. It only contains 4 non-Russian symbols, 2 of which are single and double apostrophe, symbols likely to be included anyway. The remaining two symbols transcribe the ng-sound, and consist of the cyrillic symbol for N (which is H), with a downward left-turning tail attached to the right leg of the H.

A "Latin 9-like" solution for the minority languages in Russia would be to make a standard that takes KOI8 as its staring point, and then add minority language symbols in all other positions. A Barents KOI8 code table, containing Kildin Sámi and Nenets, in addition to Russian, could be arranged as follows:

Fig. 7. A possible Barents KOI8

A Barents KOI8 construed following the principles sketched here will cover the Russian, Nenets and Kildin Sámi languages, but no other languages.

The proposed solution has approximately the same technical symbols as has Latin 9. The Kildin Sámi long vowels (marked with a horizontal line over each vowel symbol) are ignored in this proposal. They are not used in ordinary texts, but since they are used both in dictionaries and in primers, they obviously should be included in an ideal Kildin Sámi code table. Due to cyrillic practice, the 5 Kildin Sámi vowels are written with two symbols each, and since each grapheme must have two versions (small and capital), that gives us 20 more positions for 5 long vowels. The inadequacy of 8-bits code tables is thus clearly demonstrated.

An alternative solution would be to use floating length marks, place them on the 4 remaining positions, one short (for the narrow i symbol) and one longer one (for the rest), each of them in a high position (for capital letters) and in a low one (for small), i.e. - -- - -- . This would require the use of combined characters (one grapheme composed of two digits). The proposal for a new subpart of 8895 for Vietnamese uses exactly this technique to cope with the marking of the Vietnamese tones. There are problems with this solution, though, and it clearly should be replaced with a one-to-one correspondence between symbol and number.

With the introduction of ISO/IEC 10646, most of these problems are automatically solved, since almost all non-Slavic cyrillic characters of the languages in Russia are included. Still one of the languages not represented in 10646 is precisely Kildin Sámi. Thus, work aimed at revising 10646 on this particular point should be initiated as soon as possible.

6. Conclusion

Due to increased accessibility, Internet's role as a local net is likely to grow. This makes the character set technology of the net even more important. Rather than being written for the average American web surfer, the future web page will be written for well known customers, fellow citizens, or for neighbours. Thus, future web pages will reflect human life on this planet in all its colourful varieties. Let us make it possible to reflect the linguistic variation as well.

Appendix

Fig. 8. The right part of ISO/IEC 8859-2

Fig. 9. The right part of ISO/IEC 8859-3

Fig. 10. The right part of ISO/IEC 8859-4

Fig. 11. The right part of ISO/IEC 8859-10

(Latin 6)

Fig. 12. Apple Sámi standard

Fig. 13. Apple Latin standard

References

ISO/IEC 8859. Information processing - 8 bits single byte coded graphic character sets. Geneva, Switzerland 1987-1992.

ISO/IEC 10646-1. Information technology - Universal Multiple-Octet Coded Character Set (UCS) - Part 1: Architecture and Basic Multilingual Plane. Geneva, Switzerland 1993.

Heim | Språkvitskap | Språk og samfunn | Språk og IT | Språk | Undervisning
Om desse sidene | Näistä sivuista | About these pages | Andre sider
Lingvistisk institutt | Humanistisk fakultet | Universitetet i Tromsø