In computer programming, digraphs and trigraphs are sequences of two and three characters, respectively, that appear in source code and, according to a programming language's specification, should be treated as if they were single characters. Trigraphs have been removed from the C++ language, and will be from C as of C23, thus likely aren't used much in practice in C already, nor in any other mainstream language (use of them in the language J is an exception). In the modern world of Unicode/UTF-8 (even just with ASCII) there's no need for trigraphs in language design, which were considered a burden, and neither really digraphs, that likely have very few users, at least in those languages.
Various reasons exist for using digraphs and trigraphs: keyboards may not have keys to cover the entire character set of the language, input of special characters may be difficult, text editors may reserve some characters for special use and so on. Trigraphs might also be used for some EBCDIC code pages that lack characters such as {
and }
.
The basic character set of the C programming language is a subset of the ASCII character set that includes nine characters which lie outside the ISO 646 invariant character set. This can pose a problem for writing source code when the encoding (and possibly keyboard) being used does not support any of these nine characters. The ANSI C committee invented trigraphs as a way of entering source code using keyboards that support any version of the ISO 646 character set.[1]
Trigraphs are not commonly encountered outside compiler test suites. Some compilers support an option to turn recognition of trigraphs off, or disable trigraphs by default and require an option to turn them on. Some can issue warnings when they encounter trigraphs in source files. Borland supplied a separate program, the trigraph preprocessor (TRIGRAPH.EXE
), to be used only when trigraph processing is desired (the rationale was to maximise speed of compilation).
Different systems define different sets of digraphs and trigraphs, as described below.
Early versions of ALGOL predated the standardized ASCII and EBCDIC character sets, and were typically implemented using a manufacturer-specific six-bit character code. A number of ALGOL operations either lacked codepoints in the available character set or were not supported by peripherals, leading to a number of substitutions including :=
for ←
(assignment) and >=
for ≥
(greater than or equal).
The Pascal programming language supports digraphs (.
, .)
, (*
and *)
for [
, ]
, {
and }
respectively. Unlike all other cases mentioned here, (*
and *)
were and still are in wide use. However, many compilers treat them as a different type of commenting block rather than as actual digraphs, that is, a comment started with (*
cannot be closed with }
and vice versa.
The J programming language is a descendant of APL but uses the ASCII character set rather than APL symbols. Because the printable range of ASCII is smaller than APL's specialized set of symbols, .
(dot) and :
(colon) characters are used to inflect ASCII symbols, effectively interpreting unigraphs, digraphs or rarely trigraphs as standalone "symbols".
Unlike the use of digraphs and trigraphs in C and C++, there are no single-character equivalents to these in J.
See also: C alternative tokens.
Trigraph | Equivalent | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
??= | # | |||||||||||||||||||
??/ | \ | |||||||||||||||||||
??' | ^ | |||||||||||||||||||
??( | <nowiki>[</nowiki> |-| ??) || <nowiki>]</nowiki> |-| ??! || <nowiki>|</nowiki> |-| ??< || { |-| ??> || } |-| ??- || ~ |}The C preprocessor (used for C and with slight differences in C++; see below) replaces all occurrences of the nine trigraph sequences in this table by their single-character equivalents before any other processing (until C23[2]). A programmer may want to place two question marks together yet not have the compiler treat them as introducing a trigraph. The C grammar does not permit two consecutive
The which is a single logical comment line (used in C++ and C99), and which is a correctly formed block comment. The concept can be used to check for trigraphs as in the following C99 example, where only one return statement will be executed.
In 1994, a normative amendment to the C standard, C95,[4] [5] included in C99, supplied digraphs as more readable alternatives to five of the trigraphs. Unlike trigraphs, digraphs are handled during tokenization, and any digraph must always represent a full token by itself, or compose the token C++See also: C alternative tokens.
|