In 1880, Karel Holle began a project to have a list of 1000 common words translated into the many languages found across the Indonesian archipelago. Over the next 50 years the results of this would be published, revised, and expanded repeatedly to eventually comprise a rosetta stone of 300 languages, as they existed in the early 20th century. It is likely that this remarkable resource would have been lost entirely if not for the efforts of Wim Stokhof and Lia Saleh-Bronkhorst, who collated the various lists and editions—found decaying in a half-forgtten storage box—into a set of eleven volumes published between 1980 and 1987 by the Australian National University's Research School of Pacific Studies (now the College of Asia and the Pacific).

It is worth noting that though this collection is known as the Holle lists, the data itself was provided, collected, and transcribed not by Karel Holle but in most cases by the people of Indonesia. Holle was certainly reponsible for the inception of this project and its first collation, but in practice his list of words was sent out across the country to be translated by local officials, teachers, priests, village heads, other community leaders. Really, anyone in the area who would have a good knowledge of Dutch and could either translate the local language themselves or find someone who could. As such these lists are a record of local knowledge and the product of local effort, with those responsible in most cases uncredited and forgotten.


This project aims to be the modern continuation of this effort. Holle wished to record the languages of Indonesia, S&SB aimed to rescue that resource from obscurity and decay; here the aim is to record, store, and present that same wealth of information in a way that, thanks to the existence of computing and the internet, is simply better. This is by no means a snide dig at the previous efforts, indeed they had the much more labour-intensive job. But with the widespread adoption of computers and the existence of the internet, handling this kind of data has become orders of magnitude more practical. It can be ordered, searched, expanded, modified, corrected, and distributed to the world in a way that a typewriter-printed stack of eleven volumes, each a good two- or three- hundred pages long, never could have been.

The first phase of this project then is simply a digitisation of the lists published by S&SB. The second is to augment the data in them with the addition of transliterations of each entry into the native script of its language (where such a script exists), IPA transliterations of each entry, and perhaps an updating of the orthography to match that of modern Indonesian (oeu, for example).

The third and final phase is to use these augmented Holle lists as a starting point, as a skeleton from which is constructed a full, modern dictionary of every language in Indonesia through community contribution and moderation, essentially following a Wikipedia-esque philosophy. Through the internet it is possible to reach the logical endpoint of Holle's original plan pushed to its extreme: rather than slowly carrying a small sample vocabulary on paper to various local authorities, who in turn distribute it to those who know the language, waiting for a return message and capturing a momentary snapshot of the state of a language, it is now possibly to directly and instantaneously access the people using these languages day-to-day for a continuously-updated vocabulary, without loss of fidelity as translations travel through a chain of people, and through community verification without risk of anomalous translations, mistakes, and mis-transcriptions. Whole languages can be suggested and added in without requiring reprints of entire volumes, typos can be immediately fixed, errors can be caught by experts in these languages. All of this can be done with an ease that would have been unimaginable when Holle first dreamt up his project. Also worth mentioning is the incredible work of the Unicode Consortium over the past ten years or so, making it possible to work using the native scripts of these languages in most cases and even in the Latin transliterations text editors like vim now supporting UTF-8 encoding for all these diacritics and modifiers. Continuing advances like these make a project of this type feasible and have a huge, mostly underappreciated, impact on the preservation of language.

A note on the digitisation

Bash tools like sed and awk have been the workhorses. pdftk was used to trim the word lists from each volume, pdftopng to convert them to image files, and then tesseract to do the OCR. A significant amount of cleaning of the results were done with sed and awk, then it was an extremely long task of manually going through and manually verifying each entry against the original PDFs. The OCR did a fantastic job, but here and there there are understandable and expected errors. For example smudges or whatever on the PDF, or, because of the typewriter style typeface used, confusion between 1, l, and i. But by far the biggest issue was in the heavy use of various and uncommon accents and diacritical marks. These I could find no way around, so were entirely manually inserted. This is fine for languages like Mantang, for example, which are almost entirely the basic alphabet with a couple of grave accents mixed in here and there. But Bà'à and those like it barely have an unmodified vowel amongst them, half of which are combined with underscore diacritics.



Much of what is worth saying in terms of the history and importance of this effort has already been said by Stokhof and Saleh-Bronkhorst far more eloquently than I could put it, and from a position of far greater authority (and actual knowledge of linguistics, unlike myself), so here I reproduce some key parts of their introduction to Volume 1:

Extracts from Stokhof & Saleh-Bronkhorst, Vol. 1:


Box 109 in the Manuscript room of the Museum Nasional, Jalan Merdeka Barat 12, Jakarta, contains a set of papers known as the Holle lists. Partly eaten by bookworms, partly fallen apart because of old age, it seemed time to do something about it.

As early as 1880 the eminent authority and lover of the Netherlands Indies and their peoples, K.F. Holle (1829-1896) inspired by Max Muller's The Outline Dictionary for the use of Missionaries, Explorers and Students of Language, proposed to prepare a short list of approximately 1000 lexical items for dispersion throughout the archipelago in order to obtain a more detailed knowledge of the linguistic situation of the Dutch colony. As a product of his time he aptly combined scientific interests with utilitarian arguments: insight into language and dialect distribution obtained by means of interviews and wordlists would also facilitate the contact between the government and inhabitants of areas hardly ever visited and "waar toch onze bemoeienis, willens of onwillens, zal doordringen" (Introduction to the first edition). Supported by Het Bataviaasch Genootschap van Kunsten en Wetenschappen this plan was approved after some palaver by the colonial authorities in 1882, but it took twelve years to get a first booklet ready for distribution: Blanco woordenlajst uttgegeven op Last der regeering van Ned.-Indi ten behoeve van taalkvorschers in den Ned.-Indischen archipel, Batavia, Landsdrukkerij, 1894. In 1904 the first edition was slightly modified (based on suggestions of N. Adriani and others), expanded with a few short sentences and reprinted. A third impression appeared in 1911 which did not differ from the second printing. Finally, in 1931 the list was published again, but now significantly altered, expanded and provided with a different and better formulated introduction and a short questionnaire by S.J. Esser.

It could be inferred from the repeated printings that Holle's idea has been quite successful. Indeed, from 1895 until 1939 we find short acknowledgements of receipt of completed or partly completed lists scattered all over the Notulen (NBG) and the Jaanboek (JB) of the Bataviaasch Genootschap. However, on pages 189-192 of Indonesische Handschnriften only 194 lists are registered, numbered from 4 up to and including 244 (which leaves us with the problem of what has become of the missing 50 numbers). Since then not much seems to have happened: a few lists reappeared, some others disappeared, some were borrowed and never returned and some just disintegrated and were eventually dusted away. A glance at this inventory suffices to establish that the remaining 234 lists contain materials on quite a number of languages and dialects which are not well known or have even never been studied.

There are three variants of the Holle lists (1894, 1904/1911 and 1931), all slightly different in content as well as in the ordering of the items. To facilitate comparative work a new list was set up. In this new basic list (NBL) all items from all lists were introduced with the exception of a few which never or hardly ever appeared to be filled in by the researchers.

It was planned that the lists would be completed by Dutch civil servants, officers, missionaries and 'intelligente Inlanders' such as village heads, merchants and teachers, in practice all people with a reasonable knowledge of Dutch. The items were therefore printed in that language.

Only the 1931 edition adds a Malay translation and, occasionally, even Malay dialectal forms in order to enable 'Inheemsche medewerkers' to fill in the lists. The items were not alphabetically ordered but loosely arranged in various domains, e.g. parts of the body, terms of relationship, natural phenomena, utensils, etc. In the NBL this arrangement has been maintained and the items are given as encountered in the original editions. They are, however, followed by English and Indonesian translations. The Indonesian glosses are often taken from the 1931 edition but where these seemed less satisfactory, new ones are given and/or information is added (between parentheses). English, Indonesian and Dutch indexes for the NBL are provided along with an alphabetical list of Swadesh items.

Entries are not given in our lists when in the original vocabularies the following instances were attested:
  1. Blank space (i.e. when no equivalent in the local vernacular was given),
  2. When it was indicated by the researcher that the informant was not familiar with the item in question and/or that it did not occur in the region,
  3. When it was indicated that the informant did not know the equivalent in the local vernacular,
  4. When a questionmark was written without any further commentary.


Since the Museum does not allow photocopying of documents from its Manuscript room nor permit lending extra muros it seemed worthwhile to publish at least the most interesting parts of the collection, so that everybody in Indonesia or elsewhere interested in comparative and descriptive work could have free access to these materials.
Publication is considered all the more justified, for one thing, because in several cases data on the languages and dialects provided by the lists do not seem to have increased considerably since Holle's questionnaires were distributed (e.g. parts of the Moluccas and New Guinea, the Alor Archipelago, Halmahera, etc.), and for another, because our general knowledge about the language and dialect distribution in Indonesia is still very superficial. Any contribution which offers material may bring about an improvement in this deplorable situation or may function as a stimulus for further investigation, which is very much needed. The impact of Indonesian as the official language of the republic on minority language groups should not be underestimated. Due to modern means of communication, Indonesian is rapidly penetrating everyday life. The language is officially used as the vehicle of instruction at all levels of education. In the near future even those who live in the most remote areas of the archipelago will be acquainted with the Indonesian language in one of its many variants in some way or another. Since it enjoys a high status as the symbol of the unity of the nation and is intuitively linked with progress, development and such similar notions, this will inevitably introduce a different way of life with a different set of priorities (see Stokhof 1976:16).

For obvious reasons the government stresses the importance of the national language and it does not seem to be in a position to support the maintenance and use of the local languages, or to develop orthographies or educational materials. Consequently, the growth and further dispersion of Indonesian will continue at the expense of local languages and will eventually result in the disappearance of many of the smaller undescribed vernaculars. More than fragmentary information about language and dialect distribution is not available, let alone an estimate of numbers of speakers. At this very moment there may be languages which are on the verge of extinction - we are not even aware of it. Without overdramatising the situation, it seems reasonable to concentrate our attention on those language areas about which relatively little is known, to do as much survey work as possible along with full-scale grammatical descriptions of those languages which are in danger of complete extinction.

The publication of the vocabularies is meant to be a somewhat belated homage to Holle as the foresighted initiator of large-scale linguistic research and, killing two birds with one stone, it is hoped that it may contribute in some way to the so greatly needed inventory and description of the lesser-known languages of Indonesia.

We are well aware of the fact that quite a number of colleagues do not consider this kind of work worthwhile. Recently there has been a tendency demonstrable in certain linguistic quarters towards an undue separation of the field into what is called theoretical linguistics and descriptive linguistics (in the narrow sense of the word). This view has eventually resulted in a preference for the former and an underestimation of the latter. Collecting, comparing and publishing (old) wordlists is considered part of descriptive or comparative linguistics.

This seems to us a most distressing misinterpretation. General theories based on a limited set of relatively well studied languages of a certain type (predominantly Indo-European) are necessarily inadequate to explain the wealth of phenomena attested in those languages which are often ethnocentrically called exotic languages.

Descriptive linguistics, as we see it, has two basic tasks: (a) to supply data on individual languages and language groups in a meaningful way and (b) to create a linguistic metatheory in which structural properties characteristic for all human languages are related and explained. Description/analysis of individual languages presupposes a theoretical framework. On the other hand, the study of language universals without intensive and continuously increasing knowledge about all sorts of languages, is hardly imaginable: (a) and (b) are mutually dependent.

Transcription, orthography, etc.

The directions concerning the transcription of the data are the same for the first three printings. They are not always clear, especially in sequences such as [ai] as opposed to [ay] or [aiV] and have caused some misunderstanding. In most lists it is difficult to establish whether oe stands for a [u]-like sound or for [ɔɛ], [ɔe], [oɛ] or [oe], or whether ng symbolises [ŋ] or [ng].

The latter example also holds true for the 1931 impression, which , however, is less hybrid, generally speaking. In the introduction written by Holle to the first three editions, the researcher is advised to follow the Dutch orthography in his transcriptions. The fourth edition takes the Malay spelling as its starting point. Separation by / indicates a change in notation in the 1931 version.

Holle IPA
a / á [a]
å [å] or [ω]
ǎ [ɑ̌ˆ] or [əˇ]
a [ɯ]
i / ï [ι]
ī / i [i]
e / é [e]
ē / é [e:ˆ]
ě / ə [ə]
o / ò [ɔ]
ō / ó [o]
oe [u],[ʋ]
eu [ɣ]
oej, aw, aj, etc. [uj, aw, aj]
au, ai, oi, oei, ei [αu̯, αi̯, ɔi̯, ʋi̯]
a’oe, a’i, a’a, a’oe, etc. [αʔu, αʔi]
αι etc. [αˆi̯] (?) [αʔu, αʔi]
j [y]
tj [c]
dj [j]
nj [ñ]
ng [ŋ]
sj [š]
ʼ, ٴ / q [ʔ]
g̀, ɣ [ɣ]
r̀, ṛ [R, ʁ]
k̲, ḳ (k), ṯ, ṭ, (t), etc. [k ̚, t ̚], etc.
nn, kk, etc. [n:, k:], etc.
ā, aa, ī, ii, etc. [a:, i:], etc.
ã, ĩ, etc. [ã, ĩ], etc.
k̲̲, b̲̲, etc. [ᵑk, ᵐb], etc.
(k), (t), etc. []
-aᵏ, -kᵃ, etc. []
ch [x]
V’ [Vʔ]
‘V [ħV]
CV̈, VV̈ [C"V], [V"(ʔ)V]