Methodology
The four strata
Every line of an inscription is held in four parallel strata, each preserved as a distinct field rather than collapsed into a single transcription:
- Runic. The original runiform text, character by character, in Unicode. Characters are not silently normalised; allographs are preserved.
- Transliteration. A one-to-one rendering of the runiform into Latin script, retaining script-level information · vowel-harmony class, allographic distinctions · that a phonological transcription would lose.
- Transcription. A phonological reading reflecting the reconstructed pronunciation. Where editors disagree on the vocalism, both readings are recorded as alternative transcriptions, attributed to their analyst.
- Translation. A modern-language rendering (currently English and Turkish, with French planned). The translation is conservative: it does not paper over editorial uncertainty in the source.
Transliteration conventions
The transliteration follows an extended version of the conventions in Talat Tekin’s A Grammar of Orkhon Turkic (1968). Each consonantal grapheme is annotated with a superscript marking its vowel-harmony class (¹ for back, ² for front), and a small set of additional symbols disambiguates graphemes that would otherwise collapse in a Latin rendering. The full key will accompany the first public release.
Where the conventions of Mehmet Ölmez or Annemarie von Gabain diverge from Tekin’s, the divergence is recorded at the analysis level rather than imposed on the transliteration. The transliteration stratum is meant to be theory-light; the analytical apparatus is where editorial positions live.
Tokens, lemmas, morphemes
Each line is segmented into tokens, with their position in the line preserved. Each token is linked to one or more analyses; an analysis decomposes the token into segments, each pointing either to a lemma (a dictionary head) or to a morpheme (a grammatical element). Morphemes carry standard glossing labels (PST, 1SG, LOC, CAUS, …) so that the corpus can be queried at the level of grammatical category, not only of surface form.
Where a token admits more than one reading (a synchronic interpretation as a frozen lexeme alongside a diachronic decomposition into root and suffixes, or two competing vocalisations from different editors), both analyses are recorded in parallel, each attributed to its analyst with a confidence rating and editorial note.
A line in four strata
The first line of the south face of the Kül Tigin inscription (KT-S 1), shown here in all four strata with two translations:
Token analysis
When a token is selected in the corpus (here olurtum from KT-S 1), a full entry is shown: the three formal layers (runic, transliteration, transcription), the morphemic segmentation, alternative editorial readings, and an etymological note.
- olur-
- "to sit; to ascend the throne", verb stem
- -D
- past tense suffix
- -(X)m
- 1sg personal ending, possessive-derived
Lemmatised search
Because every token is linked to its lemma(s) and morpheme(s), the corpus supports search at the lemma level as well as the surface level. A click on the lemma olur- or on the morpheme -(X)m returns the list of every place that lemma or morpheme appears across the corpus, with a direct link from each occurrence to the sentence in which it stands.
Editorial attribution
No analysis is anonymous. Every reading is attributed to the editor who proposed it (Tekin, Ölmez, Gabain, and others as the corpus grows), and editorial notes are preserved alongside the segmentation. Where COLT departs from a published reading, the departure is recorded as an explicit decision, not silently absorbed.
Citation
Each inscription, face, line, sentence, token, and analysis carries a stable identifier. Citation conventions follow the pattern COLT/KT-S 1, token 10, with persistent links from each entity in the interface. A versioned, citable release of the underlying data will be deposited on Zenodo with a DOI once the encoding of the Kül Tigin inscription is complete.