UTF-EBCDIC explained

UTF-EBCDIC
Encodes:Unicode
Basedon:UTF-8
By:IBM
Definitions:Unicode Technical Report #16

UTF-EBCDIC is a character encoding capable of encoding all 1,112,064 valid character code points in Unicode using 1 to 5 bytes (in contrast to a maximum of 4 for UTF-8).[1] It is meant to be EBCDIC-friendly, so that legacy EBCDIC applications on mainframes may process the characters without much difficulty. Its advantages for existing EBCDIC-based systems are similar to UTF-8's advantages for existing ASCII-based systems. Details on UTF-EBCDIC are defined in Unicode Technical Report #16.

To produce the UTF-EBCDIC encoded version of a series of Unicode code points, an encoding based on UTF-8 (known in the specification as UTF-8-Mod) is applied first (creating what the specification calls an I8 sequence). The main difference between this encoding and UTF-8 is that it allows Unicode code points through (the C1 control codes) to be represented as a single byte and therefore later mapped to corresponding EBCDIC control codes. In order to achieve this, UTF-8-Mod uses instead of as the format for trailing bytes in a multi-byte sequence. As this can only hold 5 bits rather than 6, the UTF-8-Mod encoding of codepoints above are larger than the UTF-8 encoding.

The UTF-8-Mod transformation leaves the data in an ASCII-based format (for example, "A" is still encoded as), so each byte is fed through a reversible (one-to-one) lookup table to produce the final UTF-EBCDIC encoding. For example, in this table maps to ; thus the UTF-EBCDIC encoding of (Unicode's "A") is (EBCDIC's "A").

UTF-EBCDIC is rarely used, even on the EBCDIC-based mainframes for which it was designed. IBM EBCDIC-based mainframe operating systems, such as z/OS, usually use UTF-16 for complete Unicode support. For example, IBM Db2, COBOL, PL/I, Java and the IBM XML toolkit support UTF-16 on IBM mainframes.

Codepage layout

There are 160 characters with single-byte encodings in UTF-EBCDIC (compared to 128 in UTF-8). As can be seen, the single-byte portion is similar to IBM-1047 instead of IBM-37 due to the location of the square brackets. CCSID 37 has [] at hex BA and BB instead of at hex AD and BD respectively.

Oracle UTFE

Oracle UTFE is a Unicode 3.0 UTF-8 Oracle database variation, similar to the CESU-8 variant of UTF-8, where supplementary characters are encoded as two 4-byte characters rather than a single 4- or 5-byte character. It is used only on EBCDIC platforms.[2]

See also

External links

Notes and References

  1. Web site: UTR #16: UTF-EBCDIC. You need to search at most five bytes (seven bytes, if the full range of 31 bits of ISO/IEC 10646 is considered) backwards. 2021-02-23. www.unicode.org.
  2. Book: Oracle9i Database Globalization Support Guide . Cathy . Baird . Dan . Chiba . Winson . Chu . Jessica . Fan . Claire . Ho . Simon . Law . Geoff . Lee . Peter . Linsley . Keni . Matsuda . Tamzin . Oscroft . Shige . Takeda . Linus . Tanaka . Makoto . Tozawa . Barry . Trute . Mayumi . Tsujimoto . Ying . Wu . Michael . Yau . Tim . Yu . Chao . Wang . Simon . Wong . Weiran . Zhang . Lei . Zheng . Yan . Zhu . Valarie . Moore . . Release 2 (9.2) . 2002 . 1996 . Oracle A96529-01 . Appendix A: Locale Data . 2017-02-14 . live . https://web.archive.org/web/20170214190952/https://docs.oracle.com/cd/B10501_01/server.920/a96529.pdf . 2017-02-14.