This document describes requirements for the layout and presentation of text in the Tibetan script for use with Web standards and technologies, such as HTML, CSS, Mobile Web, Digital Publications, and Unicode. In addition to Tibet and China, the script is widely used in Bhutan, Nepal, India and throughout the Tibetan diaspora, and requirements for these regions are also included in the scope of the document.

This document describes the basic requirements layout and text support on the Web and in eBooks for the Tibetan language, using the Tibetan script, and used in China, Bhutan, India, and other countries. These requirements provide information for Web technologies such as CSS, HTML and digital publications about how to support users of the N'Ko script. The information here is developed in conjunction with a document that summarises gaps in support on the Web for Tibetan.

The document contains English and partial Chinese versions of the text, which can be filtered using the buttons at the top right of the window. The English version is the authoritative version.

The editor's draft of this document is being developed as part of the Tibetan Language Enablement program of the W3C Internationalization Interest Group. It will be published by the Internationalization Working Group. The end target for this document is a Working Group Draft Note.

Introduction

About this document

This document provides information about the Tibetan script as used for the Tibetan and Dzongka languages as used in China, Bhutan, India, and the Tibetan diaspora.

The document should contain no reference to a particular technology. For example, it should not say "CSS does/doesn't do such and such", and it should not describe how a technology, such as CSS, should implement the requirements. It is technology agnostic, so that it will be evergreen, and it simply describes how the script works. The gap analysis document is the appropriate place for all kinds of technology-specific information.

Gap analysis

This document is pointed to by a separate document, Tibetan Gap Analysis, which describes gaps in support for Tibetan orthographies on the Web, and prioritises and describes the impact of those gaps on the user.

Wherever an unsupported feature is indentified through the gap analysis process, the requirements for that feature need to be documented. Those requirements can be described here.

As gaps in support for Tibetan are captured, the gap is brought to the attention of the relevant spec developer or browser implementation community. The progress of such work is tracked in the Gap Analysis Pipeline.

Other related resources

The initial material for this document was an edited extract from the page Tibetan orthography notes. That page contains additional details about character usage and how the writing system works, plus interactive information, which are not available here.

Some additional information was based on a talk by Jianxin Yin.

The document Language enablement index points to this document and others, and provides a central location for developers and implementers to find information related to various scripts.

The W3C also maintains a tracking system that has links to GitHub issues in W3C repositories. There are separate links for (a) requests from developers to the user community for information about how scripts/languages work, (b) issues raised against a spec, and (c) browser bugs. For example, you can find out what information developers are currently seeking, and the resulting list can also be filtered by script.

Tibetan script overview

Tibetan can be written using two different styles: དབུ་ཅན dbu can with a head, the block style of the Tibetan script used in print, pronounced u.cen; and དབུ་མེད dbu med headless, the cursive style of the Tibetan script used in shorthand and calligraphy, pronounced u.me. This page concentrates on the former. Pronunciations are based on the central, Lhasa dialect.

Historically, Tibetan text was written on loose-leaf sheets called pechas, ( དཔེ་ཆ pé.t͡ɕʰá book, scripture ). Some of the characters used and formatting approaches are different in books and pechas.

Tibetan text runs left to right in horizontal lines.

Words boundaries are not indicated. However, Tibetan words are made up of one or more units called tsheg-bar which are basically equivalent to phonological syllables. The tsheg-bar units are separated using U+0F0B TIBETAN MARK INTERSYLLABIC TSHEG.

These tsheg-bar units are composed of structural elements that include vowel signs and consonants used as prefixes, root characters, subscripts, superscripts, suffixes, and secondary suffixes. A common realisation includes a stack and additional consonants to either side of the root consonant. These may indicate syllable-final consonant sounds, but more often than not they qualify or modify the root value, and are not associated with their nominal sound value. The actual pronunciation of Tibetan is usually much more simple than a typical romanisation would suggest. For example, the word བཀོད kǿː to create is transcribed as bkod.

རྒྱུད་
The single-syllable word cy᷈ː string with an initial stack of three consonants plus a vowel sign. followed by a suffix consonant (to the right).

To write the sounds of the standard Lhasa dialect, Tibetan uses 30 consonant letters (plus their subjoined forms). 6 more letters are used to write Sanskrit.

A distinguishing feature of Tibetan is the set of separate code points for subjoined consonants, used to create consonant stacks. Of the 77 combining characters in the Tibetan block, 48 represent subjoined consonant forms. Unlike many other Indic scripts, the modern Tibetan orthography doesn't use a virama to create stacks.

Tibetan is an abugida with one inherent vowel. When writing the Lhasa dialect, other post-consonant vowels are represented using 4 vowel signs, all combining marks.

There are no pre-base, circumgraph, or multipart vowels in the Tibetan used to write the Llasa dialect (though there are when writing in Sanskrit).

Standalone vowels are written by adding vowel signs to either U+0F60 TIBETAN LETTER -A or U+0F68 TIBETAN LETTER A, depending on the tone.

Sanskrit vowels written in Tibetan use additional vowel signs and combining marks, some of which represent diphthongs, and some of which form circumgraphs or multipart characters, depending on the encoding.

Tone is not explicitly indicated in the orthography.

Modern Tibetan writing uses few punctuation marks or symbols, but the Tibetan script block in Unicode contains many of these.

Tibetan has its own set of numbers.

Tibetan Syllables

The following diagram shows characters in all of the syllabic positions, and lists the characters that can appear in each of the non-root locations. The two-syllable word in the example is འགྲེམས་སྟོན 'grems-ston ɖɹemton exhibition.

Picture of syllable composition.

Syllable composition in Tibetan

Structural boundaries & markers

Head marks

In traditional, loose-leaf Tibetan pechas a head mark or yig-mgo (yig go) is used at the beginning of the front of the folio so that you can tell which is the front.

Head marks are also used in both pechas and books to indicate the start of a headline or the start of the first paragraph in a longer text.

[[[#fig_head_marks]]] shows a common head mark, U+0F04 TIBETAN MARK INITIAL YIG MGO MDUN MA, and the extension character U+0F05 TIBETAN MARK CLOSING YIG MGO SGAB MA. A head mark can be written alone, or can be followed by as many as three closing marks; head marks are also followed by two shads.

༄༅༎ ཡོངས་ཁྱབ་གསལ་བསྒྲགས་འགྲོ་བ་མིའི་ཐོབ་ཐང༌།

Example of use of head marks at the start of the Universal Declaration of Human Rights.

Head marks differ from text to text. The Unicode Standard provides a number of characters to give some basic coverage, but may not meet all needs.

Three less common head marks, used in Nyingmapa and Bonpo literature, are also represented in the Tibetan block, namely:

U+0F01 TIBETAN MARK GTER YIG MGO TRUNCATED A
U+0F02 TIBETAN MARK GTER YIG MGO -UM RNAM BCAD MA
U+0F03 TIBETAN MARK GTER YIG MGO -UM GTER TSHEG MA

Word & syllable boundaries

Word boundaries are not indicated by the Tibetan orthography. However, phonetic syllables, represented by a sequence of letters known as a tsheg-bar tsek bar, are separated by U+0F0B TIBETAN MARK INTERSYLLABIC TSHEG.

གློག་བརྙན་ཁང
The tsek in use to separate the component tsheg-bar units within a single word.
show as text

གློག་བརྙན་ཁང lô.ȵɛ̃́.kʰáŋ cinema

ཡོངས་ཁྱབ་གསལ་བསྒྲགས་
འགྲོ་བ་མིའི་ཐོབ་ཐང༌།
This figure shows the use of the tsheg-bar across a whole sentence. There is no indication of word boundaries.
show as text

ཡོངས་ཁྱབ་གསལ་བསྒྲགས་ འགྲོ་བ་མིའི་ཐོབ་ཐང༌།

See also .

Phrase & section boundaries

Sections & topics

Key divisions of the text include expressions (brjod-pa) and topics (don-tshan). They do not necessarily equate to English phrases, sentences and paragraphs.

Sections normally end with U+0F0D TIBETAN MARK SHAD (called shad but pronounced ʃe) followed by a space. Topics (eg. headlines, verses, and longer paragraphs) are often terminated with a double shad or separated with shad+space+shad.

Examples of double shad.
show as text

དུང་དང་འོ་མར་འགྲན་པའི་ལྷག་བསམ་མཐུ། །དམན་ཡང་དཀར་པོའི་བྱས་འབྲས་ཅུང་ཟད་ཅིག །བློ་དང་འདུན་པ་བཟང་བའི་རང་རིགས་ཀུན། །རྒྱལ་ཁའི་འཕྲིན་བཟང་ལས་དོན་འགྲུབ་ཕྱིར་འབད།།

A phrase that ends with the root consonant U+0F40 TIBETAN LETTER KA or U+0F42 TIBETAN LETTER GA will normally swallow up the shad that immediately follows it, even if there is a vowel sign. For example, where you might expect to see a double shad, you might see ཀུ ། and སྐུ །. However, the shad is not omitted if these characters have a subscript, eg. གྲུ། །.

GA swallowing up a shad at the end of a topic.
show as text

དུང་དང་འོ་མར་འགྲན་པའི་ལྷག་བསམ་མཐུ། །དམན་ཡང་དཀར་པོའི་བྱས་འབྲས་ཅུང་ཟད་ཅིག །བློ་དང་འདུན་པ་བཟང་བའི་རང་རིགས་ཀུན། །རྒྱལ་ཁའི་འཕྲིན་བཟང་ལས་དོན་འགྲུབ་ཕྱིར་འབད།།

GA swallowing up a shad at the end of a section.
show as text

རྩོམ་པ་པོ། ལྡོང་ཕྲུག པར་སྐྲུན་ཁང་། གངས་ཅན་པ་ཚོང་འཕྲིན་ཇུས་འགོད་ཞབས་ཞུ་ལྟེ་གནས་ནས་དཔར།

When a phrase ends with shad+space+shad the space between the shad marks is normally reduced in Tibetan pechas, down to 1/4 or 1/3 of the normal width, or made to fit the space available. Some space is retained to avoid the appearance of a double-shad.§

Boundaries between chapters or significant sections may also be represented by a double-shad followed by 5-6 spaces and another double-shad.§

Double shads separated by several spaces between chapters or significant sections.
show as text

དང་འཇམ་པའི་དབྱངས་དང་པ་བྱང་ཆུབ་སེམས་དཔའ་ཡིན༎     ༎སང་རྒྱས་བཅོམ་ལྔན་འདས་དགྲ་བཅོམ་པ་་་་་་

U+0F0E TIBETAN MARK NYIS SHAD can be used for the double-shad.

In Chinese magazine publications articles may contain no double shay as a delimiter. (The text is formatted in paragraphs.) The double shay may still be found at the very end of some articles, or at the end of each line on a page containing some verse-formatted folk literature. The applies for large parts of Bhutanese newspapers, however there are other pages with plenty of double shays - some at the end of paragraphs, some inside paragraphs.

Unicode provides U+0F0E TIBETAN MARK NYIS SHAD as a means of regularising the spacing between the two shad marks, which tends to be slightly bigger than a normal space. The space between the shad marks can be stretched during justification, however, and it's not clear how that would work when using NYIS SHAD.

U+0F08 TIBETAN MARK SBRUL SHAD is used to separate texts that are equivalent to topics and subtopics, such as the start of a smaller text, the start of a prayer, a chapter boundary, or to mark the beginning and end of insertions into text in pechas.

This drul-shad is usually surrounded on both sides by the equivalent of about three non-breaking spaces (though no rule is specified).§ The drul-shad should not appear at the beginning of a new line and the whole structure of spacing-plus-shad needs to be kept together.

For U+0F11 TIBETAN MARK RIN CHEN SPUNGS SHAD see rin_chen_spung_shad.

U+0FBE TIBETAN KU RU KHA (often repeated three times) indicates a refrain.

Tsek and section boundaries

The tsheg is not used before a shad, except after U+0F44 TIBETAN LETTER NGA. For example, note the end of the three sections in [[[#fig_tsek_shad]]]:

Tsheg not being used before shad, and of U+0F0C being used between NGA and shad.
show as text

དོན་ཚན་དང་པོ། འགྲོ་བ་མིའི་རིགས་རྒྱུད་ཡོངས་ལ་སྐྱེས་ཙམ་ཉིད་ནས་ཆེ་མཐོངས་དང༌། ཐོབ་ཐངགི་རང་དབང་འདྲ་མཉམ་དུ་ཡོད་ལ།

So that line-breaking keeps the NGA + tsheg + shad together, U+0F0C TIBETAN MARK DELIMITER TSHEG BSTAR should be used between NGA and a shad. This is a non-breaking version of the tsheg (the word 'delimiter' in the name is a misnomer).

TSEG BSTAR being used between NGA and shad.
show as text

སུ་གང་གིས་གང་ཞིག་གི་རྒྱལ་ཁབ་ཐོབ་ཐང་བཙན་ཤེད་ཀྱིས་འཕྲོག་པ་དང༌། ཡང་ན་དེའི་རྒྱལ་ཁབ་བརྗེ་བསྒྱུར་གྱི་ཐོབ་ཐང་བཀག་འགོག་བྱེདཔའི་རིགས་མི་ཆོག །

White space

Space is used as a punctuation mark in Tibetan, to separate meaning in sections. It should not appear at the start of a line.§

Spaces in Tibetan text are usually wider than spaces in English text, and typically only occur after one of the following:

  1. U+0F0D TIBETAN MARK SHAD
  2. U+0F11 TIBETAN MARK RIN CHEN SPUNGS SHAD
  3. U+0F14 TIBETAN MARK GTER TSHEG
  4. ཿ U+0F7F TIBETAN SIGN RNAM BCAD

However, numbers and embedded Western text are surrounded by smaller spaces, eg.

ལོ་ ༢༠༠༡ ཤིང་བྱ་ཟླ་ ༩ ཚེས་ ༥ ཉིན་

So that line-breaks work correctly, NBSP U+00A0 NO-BREAK SPACE should be used for spaces when they appear after U+0F40 TIBETAN LETTER KA or U+0F42 TIBETAN LETTER GA, or between 2 shad or double-shad characters. It should also be used for spacing around U+0F08 TIBETAN MARK SBRUL SHAD.

Except for special situations, such as the use of sbrul shad, it is recommended to use a single space where gaps appear, and to stretch that space where necessary.§

Emphasis

Modern texts tend to bold text for emphasis.

However, U+0F35 TIBETAN MARK NGAS BZUNG NYI ZLA may also be used to create a similar effect to underlining or to mark emphasis/honorifics.

If entered as combining characters they can be added after the vowel-sign in a stack. But this mark is normally attached to the syllable, halfway between the rendered start and end of the syllable as a whole. If there are an even number of spacing characters in the syllable the mark will fall between the two middle characters, rather than beneath one character.

Application software has to ignore this character for text processing operations such as search and collation.

Alternative methods of emphasis include use of a different colour, or the use of the prefix .

བསོ༵ད་ནམ༵ས་
Use of colour and diacritics to emphasise text.

Typography for Tibetan paragraphs

Tibetan Writing Mode

Tibetan is written horizontally and read from left to right.

藏文文字为“从左至右”书写、“从上向下”显示,也简称为横排从左至右排版。

Is Tibetan written vertically with upright glyphs at all (eg. in table headings, in pictures, etc.)? If so, does it require that all elements composing a syllable be kept together in horizontal fashion, placing just syllables one above the other? Or does each non-subjoined/combining character move to the next line? etc.

Line breaking

Normally, Tibetan only breaks after a tsek (U+0F0B TIBETAN MARK INTER-SYLLABIC TSHEG ), and doesn't break after spaces.

Line breaks do not occur after a tsek when it follows U+0F44 TIBETAN LETTER NGA (with or without a vowel sign) and precedes a shay (U+0F0D TIBETAN MARK SHAD ).

The Unicode Standard also talks of other instances where Tibetan grammatical rules do not permit a break, but it isn't clear what those are.

If the character after NGA is an ordinary tsek, then lines should not break between the tsek and the shay. Text is likely to be more portable if content authors use the TSHEG BSTAR in these locations, instead of the normal tsek.

Line breaks are also possible after:

Tibetan never breaks inside a syllable, and has no hyphenation. If a word is composed of multiple syllables, it is also preferable to avoid breaking a line in the middle of the word.

A line must never start with a shad.

Line breaks and rin chen spungs shad

In Tibetan, especially in pechas, it is considered a special case if the last syllable of an expression that is terminated by a shay breaks onto a new line. In that case the shay or double shay is replaced by rin chen spungs shad, U+0F11 TIBETAN MARK RIN CHEN SPUNGS SHAD .

At the end of a topic the rules say that only one shay should be converted, ie. ༑ །, however it is moderately popular to convert both, ie. ༑ ༑. This change serves as an optical indication that there is a left-over syllable at the beginning of the line that actually belongs to the preceding line.

This varies in the following cases:

  • when a line starts with ལེའུ། །, no rin chen spungs shad would be used, since le'u is pronounced as two syllables.

  • sometimes only the first of two shays is replaced, ie. ༑ །, but this style is considered less attractive.

  • some printed books do not use rin chen spungs shad replacements, however the majority of books apply the same rules as are used with pechas.

In an environment where the width or content of the page can change, this feature poses a problem for the content author. The application needs to be able to automatically switch between the two styles of shad as a syllable moves on or off a new line when the page is resized or when preceding content is modified.

The Unicode Standard adds: "Not only is rin-chen-spungs-shad used as the replacement for the shay but a whole class of “ornamental shays ” are used for the same purpose. All are scribal variants on a rin-chen-spungs-shad, which is correctly written with three dots above it."

What are the rules for line-wrap for other symbols? Are there rules about certain characters not starting/ending a line?

Justification

There are two alternative methods of justification.

Inter-character spacing

Spacing between all characters should be adapted equally. Note that the width of the white-space character should not be changed significantly, so Tibetan texts use the non-breaking space mentioned above, which doesn't change width on justification.

Tsek padding

While hand writing, authors add small spaces across the text to get the line end as near as possible to the right margin. Where space remains at the margin, it may be left as is, if it is short. Otherwise, the remaining space will be filled with tseks to make the line as flush as possible with the right margin (there will usually still be a slight raggedness to the right edge of the text).

A page of a booklet showing tsek padding.

There are a couple of detailed rules about the use of tsek padding.

Justifying tseks are almost always used when the line ends in a tsek. If, however, the line ends in a shay, there are a number of alternatives.

If the line ends with a single shay the shay is followed by spaces. Tsek padding is never applied after spaces. (See examples in the figure above.)

If the line ends in a double shay (with space between), it is unusual (though possible) to add tsek padding. Instead, the space between the shays is stretched or narrowed. (See examples in the figure below.) The same applies if the second shay was removed because it was preceded by a KA or GA.

Booklet pages showing double shay usage at the end of a line.

Baseline alignment

When text in smaller annotations or larger heading text is mixed with normal text, the letter-heads of all characters should align to the same height.

Lists and counters

This section needs attention. Questions include: Is a space expected after the counter? If so, is it a non-break space? Is any other punctuation needed after a Tibetan counter? How common is the Tibetan vs European counter?

Tibetan numerals can be used for list counters. The Tibetan numbers are used in a simple decimal notation, ie. in the same way as European numerals. They differ only in shape.

༡ འ་ཞ་མི་རིགས་ཀྱིས་བསྐྲུན་པའི་ཤིང་གི་ཟམ་པ།

༢ ལོ་ངོ་800ཡི་ལོ་རྒྱུས་ལྡན་པའི་དགོན་རྙིང་ཆོས་པོ་དགོ།

༣ ཆི་ཅ་ཞེས་པའི་ཁྱིམ་རྒྱུད་ཀྱི་བང་སོའི་ཚོགས།

Examples of Tibetan counters in a list.

European numerals can be used for list counters. The European numeral is followed by a period.

1. འ་ཞ་མི་རིགས་ཀྱིས་བསྐྲུན་པའི་ཤིང་གི་ཟམ་པ།

2. ལོ་ངོ་800ཡི་ལོ་རྒྱུས་ལྡན་པའི་དགོན་རྙིང་ཆོས་པོ་དགོ།

3. ཆི་ཅ་ཞེས་པའི་ཁྱིམ་རྒྱུད་ཀྱི་བང་སོའི་ཚོགས།

Examples of European numeral counters in a list.

References

报告中引用到的中国国家标准,或其他标准文本

Revision Log

Acknowledgements

The initial material for this document was an edited extract from the page Tibetan orthography notes. That page contains additional details about character usage and how the writing system works, plus interactive information, which are not available here.

胡春明 (Chunming Hu) prepared an early translation of parts of this document (now removed).

This document has been developed with contributions from participants of the Chinese Layout Requirement Task Force, with kind help from experts from 信标委中文信息处理分技术委员会及藏文信息处理工作组.