Private Use Areas Explained

In Unicode, a Private Use Area (PUA) is a range of code points that, by definition, will not be assigned characters by the Unicode Consortium.[1] Three private use areas are defined: one in the Basic Multilingual Plane, and one each in, and nearly covering, planes 15 and 16 . The code points in these areas cannot be considered as standardized characters in Unicode itself. They are intentionally left undefined so that third parties may define their own characters without conflicting with Unicode Consortium assignments. Under the Unicode Stability Policy,[2] the Private Use Areas will remain allocated for that purpose in all future Unicode versions.

Assignments to Private Use Area characters need not be private in the sense of strictly internal to an organisation; a number of assignment schemes have been published by several organisations. Such publication may include a font that supports the definition (showing the glyphs), and software making use of the private-use characters (e.g. a graphics character for a "print document" function). By definition, multiple private parties may assign different characters to the same code point, with the consequence that a user may see one private character from an installed font where a different one was intended.

Definition

Under the Unicode definition, code points in the Private Use Areas are assigned characters—they are not noncharacters, reserved, or unassigned. Their category is "Other, private use (Co)", and no character names are specified. No representative glyphs are provided, and character semantics are left to private agreement.

Private-use characters are assigned Unicode code points whose interpretation is not specified by this standard and whose use may be determined by private agreement among cooperating users. These characters are designated for private use and do not have defined, interpretable semantics except by private agreement....No charts are provided for private-use characters, as any such characters are, by their very nature, defined only outside the context of this standard.[3]

Assignment

In the Basic Multilingual Plane (plane 0), the block titled Private Use Area has 6400 code points.

Planes 15 and 16 are almost[4] entirely assigned to two further Private Use Areas, Supplementary Private Use Area-A and Supplementary Private Use Area-B respectively. In UTF-16 a subset of the high surrogates (U+DB80..U+DBFF) is used for these and only these planes, and are called High Private Use Surrogates.

Unicode PUA blocks

There are three PUA blocks in Unicode.

Style:float:left; margin-left:0;
Blockname:Private Use Area
Rangestart:E000
Rangeend:F8FF
Script1:Unknown
1 0 0:5632
1 0 1:768
Note:Version 1.0.1 moved and expanded the Private Use Area block (previously located at U+E800-U+FDFF in version 1.0.0).[5] [6] [7]
Style:float:left
Blockname:Supplementary Private Use Area-A
Rangestart:F0000
Rangeend:FFFFF
Script1:Unknown
Nonchar:2
2 0:65534
Codechart:https://unicode.org/charts/PDF/UF0000.pdf
Style:float:left
Blockname:Supplementary Private Use Area-B
Rangestart:100000
Rangeend:10FFFF
Script1:Unknown
Nonchar:2
2 0:65534
Codechart:https://unicode.org/charts/PDF/U100000.pdf

History

In Unicode 1.0.0, the private use area extended from U+E800 to U+FDFF (i.e. did not include U+E000..E7FF, but additionally included the U+F900..FDFF range now occupied by CJK Compatibility Ideographs, Alphabetic Presentation Forms and Arabic Presentation Forms-A).[8] This was changed to U+E000..F8FF in Unicode 1.0.1, and remained so in Unicode 1.1.[9] Contrary to misconception, the range U+D800..DFFF (reserved for UTF-16 surrogates since Unicode 2.0) was not included in the private use range of any Unicode 1.x version.

Historically, planes E0 (224) through FF (255), and groups 60 (96) though 7F (127) of the Universal Coded Character Set (i.e. U+E00000 through U+FFFFFF and U+60000000 through U+7FFFFFFF) were also designated as private use. These ranges were removed from the specified private-use ranges when the UCS was restricted to the seventeen planes reachable in UTF-16.[10]

Usage

Standardization initiative uses

Many people and institutions have created character collections for the PUA. Some of these private use agreements are published, so other PUA implementers can aim for unused or less-used code points to prevent overlaps. Several characters and scripts previously encoded in private use agreements have actually been fully encoded in Unicode, necessitating mappings from the PUA to other Unicode code points.

One of the more well-known and broadly implemented PUA agreements is maintained by the ConScript Unicode Registry (CSUR). The CSUR, which is not officially endorsed or associated with the Unicode Consortium, provides a mapping for constructed scripts, such as Klingon pIqaD and Ferengi script (Star Trek), Tengwar and Cirth (J.R.R. Tolkien's cursive and runic scripts), Alexander Melville Bell's Visible Speech, and Dr. Seuss' alphabet from On Beyond Zebra. The CSUR previously encoded the undeciphered Phaistos characters, as well as the Shavian and Deseret alphabets, which have all been accepted for official encoding in Unicode.

Another common PUA agreement is maintained by the Medieval Unicode Font Initiative (MUFI). This project is attempting to support all of the scribal abbreviations, ligatures, precomposed characters, symbols, and alternate letterforms found in medieval texts written in the Latin alphabet. The express purpose of MUFI is to experimentally determine which characters are necessary to represent these texts, and to have those characters officially encoded in Unicode. As of Unicode version 5.1, 152 MUFI characters have been incorporated into the official Unicode encoding.

Some agreed-upon PUA character collections exist in part or whole because the Unicode Consortium is in no hurry to encode them. Some, such as unrepresented languages, are likely to end up encoded in the future. Some unusual cases such as fictional languages are outside the usual scope of Unicode but not explicitly ruled out by the principles of Unicode, and may show up eventually (such as the Star Trek and Tolkien writing systems). In other cases, the proposed encoding violates one or more Unicode principles and hence is unlikely to ever be officially recognized by Unicode—mostly where users want to directly encode alternate forms, ligatures, or base-character-plus-diacritic combinations (such as the TUNE scheme).

Publishing organizationTopic PUA area used Font
Artificial and some ancient/medieval scripts PUA (BMP) and Plane 15 Code2000
Medieval scripts PUA (BMP) several
Phonetics and languages PUA (BMP)
Ancient and medieval scripts PUA (BMP) TITUS Cyberbit Basic

Vendor use

Informally, the range U+F000 through U+F8FF is known as the Corporate Use Area. This originates from early versions of Unicode, which defined an "End User Zone" extending from U+E000 upward and a "Corporate Use Zone" extending from U+F8FF downward, with the boundary between the two left undefined.[9]

Private-use characters in other character sets

The concept of reserving specific code points for Private Use is based on similar earlier usage in other character sets. In particular, many otherwise obsolete characters in East Asian scripts continue to be used in specific names or other situations, and so some character sets for those scripts made allowance for private-use characters (such as the user-defined planes of CNS 11643, or gaiji in certain Japanese encodings). The Unicode standard references these uses under the name "End User Character Definition" (EUCD).[3]

Additionally, the C1 control block contains two codes intended for private use "control functions" by ECMA-48: 0x91 (PU1) and 0x92 (PU2).[35] [36] Unicode includes these at and but defines them as control characters (category Cc), not private-use characters (category Co).[6] [37]

Encodings which do not have private use areas but have more or less unused areas, such as ISO/IEC 8859 and Shift JIS, have seen uncontrolled variants of these encodings evolve.[38] For Unicode, software companies can use the Private Use Areas for their desired additions.

References

Notes and References

  1. Web site: Unicode Consortium. Glossary of Unicode Terms: "Private Use Area (PUA)".
  2. Web site: Unicode Character Encoding Stability Policy . 2022-03-03 . 2021-11-10.
  3. Web site: Chapter 23 Special Areas and Format Characters. The Unicode Standard Version 14.0 - Core Specification. Private Use characters.
  4. The last two characters of every plane are defined to be noncharacters. The remaining 65,534 characters of each of planes 15 and 16 are assigned as private-use characters.
  5. Web site: Unicode 1.0.1. The Unicode Standard. 1992-11-03. 2016-07-09. 2016-07-02. https://web.archive.org/web/20160702004420/http://www.unicode.org/versions/Unicode1.0.0/Notice.pdf. live.
  6. Web site: Unicode character database. The Unicode Standard. 2023-07-26.
  7. Web site: Enumerated Versions of The Unicode Standard. The Unicode Standard. 2023-07-26.
  8. Book: https://www.unicode.org/versions/Unicode1.0.0/ch03_5.pdf . 3.5: Private Use Area . 0-201-56788-1 . The Unicode Standard, Version 1.0, Volume 1 . 1991 . . 118–119 . 2021-10-11 . 2021-10-21 . https://web.archive.org/web/20211021205258/https://www.unicode.org/versions/Unicode1.0.0/ch03_5.pdf . live .
  9. Book: https://www.unicode.org/versions/Unicode1.1.0/ch02.pdf . 2.0: Changes in Unicode 1.0 . The Unicode Standard, Version 1.1 . UTR #4 . . 3–4 . 2021-10-11 . 2021-11-20 . https://web.archive.org/web/20211120194908/https://www.unicode.org/versions/Unicode1.1.0/ch02.pdf . live .
  10. Web site: Necessary changes for ISO/IEC 10646 regarding the PUA . Whistler . Ken . 2000 . UTC/00-015 . 2021-01-30 . 2021-06-23 . https://web.archive.org/web/20210623065232/https://www.unicode.org/L2/L2000-UTC/u2000-015.txt . live .
  11. Web site: Letter Database . Eki.ee . 2013-04-11 . 2018-05-21 . https://web.archive.org/web/20180521182103/http://www.eki.ee/letter/chardata.cgi?ucode=e000-f8ff . live .
  12. Web site: Character Sets: East Asian Characters: Alternative Unicode Mappings for MARC 21 Characters Assigned to the Private Use Area (PUA): MARC 21 Specifications for Record Structure, Character Sets, and Exchange Media . Library of Congress . 2004-09-02 . 2013-04-11 . 2013-08-19 . https://web.archive.org/web/20130819180025/http://www.loc.gov/marc/specifications/specchar.pua.html . live .
  13. Web site: tunerfc.tn.nic.in . tunerfc.tn.nic.in . 2013-04-11 . https://web.archive.org/web/20100729194712/http://www.tunerfc.tn.nic.in/ . 2010-07-29 . dead .
  14. Web site: Unicode Corporate Use Subarea as used by Adobe Systems . October 22, 1998 . https://web.archive.org/web/20021009225850/http://partners.adobe.com/asn/developer/type/corporateuse.txt . October 9, 2002 . May 12, 2021 . dead.
  15. Web site: NSOpenStepUnicodeReservedBase - Apple Developer Documentation . Apple Inc. . 2020-10-16 . 2020-11-06 . https://web.archive.org/web/20201106115702/https://developer.apple.com/documentation/foundation/nsopenstepunicodereservedbase . live .
  16. Web site: CORPCHAR.TXT - Registry (external version) of Apple use of Unicode corporate-zone characters . Apple Computer, Inc. . 2005 . 1994 . . c03 . 2020-10-16 . 2020-10-30 . https://web.archive.org/web/20201030195128/https://unicode.org/Public/MAPPINGS/VENDORS/APPLE/CORPCHAR.TXT . live .
  17. Web site: WGL4 Unicode Range U+2013 through U+FB02. . https://web.archive.org/web/20140717022830/http://www.microsoft.com/typography/otspec/wgl4e.htm. 2014-07-17. dead.
  18. Web site: SFM Converts Macintosh HFS Filenames to NTFS Unicode . February 24, 2014 . Microsoft Support . https://web.archive.org/web/20160527200113/https://support.microsoft.com/en-us/kb/117258 . May 27, 2016 . dead.
  19. Web site: ntfs.util.c . 2008 . Invalid NTFS filename characters are encodeded using the SFM (Services for Macintosh) private use Unicode characters. . 2018-08-07 . 2018-08-07 . https://web.archive.org/web/20180807190401/https://opensource.apple.com/source/ntfs/ntfs-91.50.2/util/ntfs.util.c.auto.html . live .
  20. Web site: Microsoft Knowledge Base. The range of characters between U+F020 and U+F0FF in the Private Use Area of Unicode is mapped to symbol fonts in Richedit 4.1. https://web.archive.org/web/20121022095705/http://support.microsoft.com/kb/897872. 2012-10-22. dead.
  21. Web site: Handling of PUA Characters in Microsoft Software . 2003-04-25 . SIL International . https://web.archive.org/web/20150511005915/http://scripts.sil.org/cms/scripts/page.php?site%5Fid=nrsi&item%5Fid=PUACharsInMSSotware . 2015-05-11 . dead . 2014-03-04 .
  22. Web site: Comment #8 : Bug #651606 (circle-of-friends) : Bugs : Ubuntu Font Family. 2020-10-17. Launchpad. 5 October 2010 . en. 2020-10-17. https://web.archive.org/web/20201017022723/https://bugs.launchpad.net/ubuntu-font-family/+bug/651606/comments/8/+index. live.
  23. Web site: Comment #2 : Bug #853855 : Bugs : Ubuntu Font Family. 2020-10-17. Launchpad. 26 September 2011 . en. 2020-10-17. https://web.archive.org/web/20201017213250/https://bugs.launchpad.net/ubuntu-font-family/+bug/853855/comments/2/+index. live.
  24. Web site: Powerline status line plugin question on Stack Exchange mentioning private use area characters . 2015-03-22 . 2015-03-12 . https://web.archive.org/web/20150312201725/http://superuser.com/questions/762345/powerline-patched-fonts-on-osx-10-9-3-iterm2-chrome . live .
  25. Web site: Pictures showing private use area characters in Powerline patched fonts . 2015-03-22 . 2015-05-11 . https://web.archive.org/web/20150511211946/https://gist.github.com/agnoster/3712874#file-characters-png . live .
  26. Web site: Li . Renzhi . Proposal to add additional characters into the Graphics for Legacy Computing block of the UCS . 2023-07-31 . 2019-08-23.
  27. Web site: lmb-excp.ucm . . 2000-02-10 . 2020-04-23 . 2022-01-25 . https://web.archive.org/web/20220125034059/https://github.com/unicode-org/icu/blob/main/icu4c/source/data/mappings/lmb-excp.ucm . live .
  28. Book: Lotus 1-2-3 Version 3.1 Referenzhandbuch . de . Lotus 1-2-3 Version 3.1 Reference Manual . 1 . Anhang 2. Der Lotus Multibyte Zeichensatz (LMBCS) . Appendix 2. The Lotus Multibyte Character Set (LMBCS) . A2–1 – A2–13 . 1989 . . Cambridge, Massachusetts, US . 302168.
  29. Web site: [ftp://ftp.software.ibm.com/software/globalization/gcoc/attachments/CP01445.pdf CPGID 01445 (chart) ]. REGISTRY: Graphic Character Sets and Code Pages . C-H 3-3220-050 . 2012 . 2011 . The area shown in the chart above represents only 254 bytes of row FF in plane 0F..
  30. Web site: [ftp://ftp.software.ibm.com/software/globalization/gcoc/attachments/CP01445.txt CPGID 01445: IBM AFP PUA No. 1 ]. REGISTRY: Graphic Character Sets and Code Pages . C-H 3-3220-050 . 2012 . 2011 . The area shown in the chart above represents only 254 bytes of row FF in plane 0F..
  31. Web site: https://web.archive.org/web/20150916190822/http://www-01.ibm.com/software/globalization/cp/cp01449.html . 2015-09-16 . CPGID 01449: IBM default PUA . dead . IBM Globalization: Code page identifiers . . IBM has designated 195 positions from U+F83D to U+F8FF for use as IBM Corporate-zone and intends to use them consistently within IBM whenever there is a need to maintain the round-trip integrity of IBM characters..
  32. (Included with)
  33. Web site: Configure character mapping for SMB file name translation on volumes. 9 December 2021 . 2022-10-14.
  34. Web site: Twitter Chirp Font. Copy Paste Dump. 2022-02-08.
  35. Web site: Standard ECMA-48, Fifth Edition - June 1991. §8.2.14 Miscellaneous control functions, §8.3.100, §8.3.101.
  36. 77 . C1 Control Character Set of ISO 6429 . ISO/TC97/SC2 . ISO/IEC JTC 1/SC 2#History . 1983-10-01 . 2022-03-03.
  37. Web site: Chapter 4 Character Properties. The Unicode Standard Version 14.0 - Core Specification. Table 4-4.
  38. Web site: Map (external version) from Mac OS Japanese encoding to Unicode 2.1 and later. . 2021-10-08 . 2021-08-31 . https://web.archive.org/web/20210831135118/http://ftp.unicode.org/Public/MAPPINGS/VENDORS/APPLE/JAPANESE.TXT . live .