UTF-8 explained

UTF-8
Standard:Unicode Standard
Classification:Unicode Transformation Format, extended ASCII, variable-length encoding
Encodes:ISO/IEC 10646 (Unicode)
Extends:ASCII
Prev:UTF-1

UTF-8 is a variable-length character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode Transformation Format 8-bit.[1]

UTF-8 is capable of encoding all 1,112,064 valid Unicode code points using one to four one-byte (8-bit) code units. Code points with lower numerical values, which tend to occur more frequently, are encoded using fewer bytes. It was designed for backward compatibility with ASCII: the first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single byte with the same binary value as ASCII, so that valid ASCII text is valid UTF-8-encoded Unicode as well.

UTF-8 was designed as a superior alternative to UTF-1, a proposed variable-length encoding with partial ASCII compatibility which lacked some features including self-synchronization and fully ASCII-compatible handling of characters such as slashes. Ken Thompson and Rob Pike produced the first implementation for the Plan 9 operating system in September 1992.[2] [3] This led to its adoption by X/Open as its specification for FSS-UTF,[4] which would first be officially presented at USENIX in January 1993[5] and subsequently adopted by the Internet Engineering Task Force (IETF) in [6] for future internet standards work, replacing Single Byte Character Sets such as Latin-1 in older RFCs.

UTF-8 results in fewer internationalization issues than any alternative text encoding, and it has been implemented in all modern operating systems, including Microsoft Windows, and standards such as JSON, where, as is increasingly the case, it is the only allowed form of Unicode.

UTF-8 is the dominant encoding for the World Wide Web (and internet technologies), accounting for 98.2% of all web pages, 99.1% of the top 100,000 pages, and up to 100% for many languages, . Virtually all countries and languages have 95% or more use of UTF-8 encodings on the web.

Naming

The official name for the encoding is UTF-8, the spelling used in all Unicode Consortium documents. Most standards officially list it in upper case as well, but all that do are also case-insensitive and utf-8 is often used in code.

Some other spellings may also be accepted by standards, e.g. web standards (which include CSS, HTML, XML, and HTTP headers) explicitly allow utf8 (and disallow "unicode") and many aliases for encodings.[7] Spellings with a space e.g. "UTF 8" should not be used. The official Internet Assigned Numbers Authority also lists csUTF8 as the only alias,[8] which is rarely used.

In Windows, UTF-8 is codepage 65001[9] (i.e. CP_UTF8 in source code).

In MySQL, UTF-8 is called utf8mb4[10] (with utf8mb3, and its alias utf8, being a subset encoding for characters in the Basic Multilingual Plane[11]).

In HP PCL, the Symbol-ID for UTF-8 is 18N.[12]

In Oracle Database (since version 9.0), AL32UTF8[13] means UTF-8. See also CESU-8 for an almost synonym with UTF-8 that rarely should be used.

UTF-8-BOM and UTF-8-NOBOM are sometimes used for text files which contain or do not contain a byte-order mark (BOM), respectively. In Japan especially, UTF-8 encoding without a BOM is sometimes called UTF-8N.[14] [15]

Encoding

UTF-8 encodes code points in one to four bytes, depending on the value of the code point. In the following table, the characters are replaced by the bits of the code point:

Code point ↔ UTF-8 conversion! First code point! Last code point! Byte 1! Byte 2! Byte 3! Byte 4
U+00U+00
U+0U+0
U+U+
U+U+

The first 128 code points (ASCII) need 1 byte. The next 1,920 code points need two bytes to encode, which covers the remainder of almost all Latin-script alphabets, and also IPA extensions, Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac, Thaana and N'Ko alphabets, as well as Combining Diacritical Marks. Three bytes are needed for the remaining 61,440 codepoints of the Basic Multilingual Plane (BMP), including most Chinese, Japanese and Korean characters. Four bytes are needed for the 1,048,576 codepoints in the other planes of Unicode, which include emoji (pictographic symbols), less common CJK characters, various historic scripts, and mathematical symbols.

A whole graphic character can take more than 4 bytes, because it is made of more than one code point. For instance, a national flag character takes 8 bytes since it is "constructed from a pair of Unicode scalar values" both from outside the BMP.[16]

Examples

In the following examples, red, green, and blue digits indicate how bits from the code point are distributed among the UTF-8 bytes. Additional bits added by the UTF-8 encoding process are shown in black.

  1. The Unicode code point for the euro sign € is U+20AC.
  2. As this code point lies between U+0800 and U+FFFF, this will take three bytes to encode.
  3. Hexadecimal is binary . The two leading zeros are added because a three-byte encoding needs exactly sixteen bits from the code point.
  4. Because the encoding will be three bytes long, its leading byte starts with three 1s, then a 0
  5. The four most significant bits of the code point are stored in the remaining low order four bits of this byte, leaving 12 bits of the code point yet to be encoded .
  6. All continuation bytes contain exactly six bits from the code point. So the next six bits of the code point are stored in the low order six bits of the next byte, and is stored in the high order two bits to mark it as a continuation byte (so).
  7. Finally the last six bits of the code point are stored in the low order six bits of the final byte, and again is stored in the high order two bits .

The three bytes can be more concisely written in hexadecimal, as .

The following table summarizes this conversion, as well as others with different lengths in UTF-8.

UTF-8 encoding process
Character Binary code point Binary UTF-8 Hex UTF-8
£
И
Suppl Private Use Area B

The Vietnamese phrase Vietnamese: Mình nói tiếng Việt ("I speak Vietnamese") is encoded as follows:

CharacterM ì n h n ó i t i ế n g V i t
Code point4D EC 6E 68 20 6E F3 69 20 74 69 1EBF 6E 67 20 56 69 1EC7 74
Hex

Codepage layout

The following table summarizes usage of UTF-8 code units (individual bytes or octets) in a code page format. The upper half is for bytes used only in single-byte codes, so it looks like a normal code page; the lower half is for continuation bytes and leading bytes and is explained further in the legend below.

Overlong encodings

In principle, it would be possible to inflate the number of bytes in an encoding by padding the code point with leading 0s. To encode the euro sign € from the above example in four bytes instead of three, it could be padded with leading 0s until it was 21 bits long, and encoded as (or in hexadecimal). This is called an overlong encoding.

The standard specifies that the correct encoding of a code point uses only the minimum number of bytes required to hold the significant bits of the code point. Longer encodings are called overlong and are not valid UTF-8 representations of the code point. This rule maintains a one-to-one correspondence between code points and their valid encodings, so that there is a unique valid encoding for each code point. This ensures that string comparisons and searches are well-defined.

Invalid sequences and error handling

Not all sequences of bytes are valid UTF-8. A UTF-8 decoder should be prepared for:

Many of the first UTF-8 decoders would decode these, ignoring incorrect bits and accepting overlong results. Carefully crafted invalid UTF-8 could make them either skip or create ASCII characters such as, slash, or quotes. Invalid UTF-8 has been used to bypass security validations in high-profile products including Microsoft's IIS web server[17] and Apache's Tomcat servlet container.[18] states "Implementations of the decoding algorithm MUST protect against decoding invalid sequences." The Unicode Standard requires decoders to

"... treat any ill-formed code unit sequence as an error condition. This guarantees that it will neither interpret nor emit an ill-formed code unit sequence."

Since RFC 3629 (November 2003), the high and low surrogate halves used by UTF-16 (U+D800 through U+DFFF) and code points not encodable by UTF-16 (those after U+10FFFF) are not legal Unicode values, and their UTF-8 encoding must be treated as an invalid byte sequence. Not decoding unpaired surrogate halves makes it impossible to store invalid UTF-16 (such as Windows filenames or UTF-16 that has been split between the surrogates) as UTF-8,while it is possible with WTF-8.

Some implementations of decoders throw exceptions on errors.[19] This has the disadvantage that it can turn what would otherwise be harmless errors (such as a "no such file" error) into a denial of service. For instance early versions of Python 3.0 would exit immediately if the command line or environment variables contained invalid UTF-8.[20]

Since Unicode 6 (October 2010),[21] the standard (chapter 3) has recommended a "best practice" where the error is either one byte long, or ends before the first byte that is disallowed. In these decoders is two errors (2 bytes in the first one). This means an error is no more than three bytes long and never contains the start of a valid character, and there are possible errors.[22] The standard also recommends replacing each error with the replacement character "�" (U+FFFD).

These recommendations are not often followed. It is common to consider each byte to be an error, in which case is three errors (each 1 byte long). This means there are only 128 different errors, and it is also common to replace them with 128 different characters, to make the decoding "lossless".

Byte-order mark

If the Unicode byte-order mark (BOM, U+FEFF, technically the character) is at the start of a UTF-8 file, the first three bytes will be,, .

The Unicode Standard neither requires nor recommends the use of the BOM for UTF-8, but warns that it may be encountered at the start of a file trans-coded from another encoding. While ASCII text encoded using UTF-8 is backward compatible with ASCII, this is not true when Unicode Standard recommendations are ignored and a BOM is added. A BOM can confuse software that isn't prepared for it but can otherwise accept UTF-8, e.g. programming languages that permit non-ASCII bytes in string literals but not at the start of the file. Nevertheless, there was and still is software that always inserts a BOM when writing UTF-8, and refuses to correctly interpret UTF-8 unless the first character is a BOM (or the file only contains ASCII).[23]

Adoption

See also: Popularity of text encodings.

UTF-8 has been the most common encoding for the World Wide Web since 2008.[24], UTF-8 is used by 98.2% of surveyed web sites.[25] Although many pages only use ASCII characters to display content, very few websites now declare their encoding to only be ASCII instead of UTF-8.[26] Over 50% of the languages tracked have 100% UTF-8 use.

Many standards only support UTF-8, e.g. JSON exchange requires it (without a byte-order mark (BOM)).[27] UTF-8 is also the recommendation from the WHATWG for HTML and DOM specifications, and stating "UTF-8 encoding is the most appropriate encoding for interchange of Unicode"[28] and the Internet Mail Consortium recommends that all e‑mail programs be able to display and create mail using UTF-8.[29] [30] The World Wide Web Consortium recommends UTF-8 as the default encoding in XML and HTML (and not just using UTF-8, also declaring it in metadata), "even when all characters are in the ASCII range ... Using non-UTF-8 encodings can have unexpected results".[31]

Lots of software has the ability to read/write UTF-8. It may though require the user to change options from the normal settings, or may require a BOM (byte-order mark) as the first character to read the file. Examples of software supporting UTF-8 include Microsoft Word,[32] [33] [34] Microsoft Excel (2016 and later),[35] [36] Google Drive, LibreOffice and most databases.

However for local text files UTF-8 usage is less prevalent, where legacy single-byte (and a few CJK multi-byte) encodings remain in use. The primary cause for this are outdated text editors that refuse to read UTF-8 unless the first bytes of the file encode a byte-order mark (BOM).[37]

Some software can only read and write UTF-8 (or at least does not require a BOM).[38] Windows Notepad, in all currently supported versions of Windows, defaults to writing UTF-8 without a BOM (a change from the outdated / unsupported Notepad), bringing it into line with most other text editors.[39] Some system files on Windows 11 require UTF-8[40] with no requirement for a BOM, and almost all files on macOS and Linux are required to be UTF-8 without a BOM. Java 18 defaults to reading and writing files as UTF-8,[41] and in older versions (e.g. LTS versions) only the NIO API was changed to do so. Many other programming languages default to UTF-8 for I/O, including Ruby 3.0[42] [43] and R 4.2.2.[44] All current versions of Python support UTF-8 for I/O, even on Windows (where it is opt-in for the open function[45]), and plans exist to make UTF-8 I/O the default in Python 3.15 on all platforms.[46] [47] C++23 adopts UTF-8 as the only portable source code file format (surprisingly there was none before).[48]

Usage of UTF-8 in memory is much lower than in other areas, UTF-16 is often used instead. This occurs particularly in Windows, but also in JavaScript, Python,Qt, and many other cross-platform software libraries. Compatibility with the Windows API is the primary reason for this, that choice was initially done due to the belief that direct indexing of the BMP would improve speed. Translating from/to external text which is in UTF-8 slows software down, and more importantly introduces bugs when different pieces of code do not do the exact same translation.

Back-compatibility is a serious impediment to changing code to use UTF-8 instead of a 16-bit encoding, but this is happening. The default string primitive in Go,[49] Julia, Rust, Swift 5,and PyPy[50] uses UTF-8 internally in all cases. Python 3.3 uses UTF-8 internally for Python C API extensions[51] and sometimes for strings[52] and a future version of Python is planned to store strings as UTF-8 by default.[53] Modern versions of Microsoft Visual Studio use UTF-8 internally.[54] Microsoft's SQL Server 2019 added support for UTF-8, and using it results in a 35% speed increase, and "nearly 50% reduction in storage requirements."

All currently supported Windows versions support UTF-8 in some way (including Xbox); partial support has existed since at least Windows XP., Microsoft has reversed its previous position of only recommending UTF-16; the capability to set UTF-8 as the "code page" for the Windows API was introduced; and Microsoft recommends programmers use UTF-8, and even states "UTF-16 [...] is a unique burden that Windows places on code that targets multiple platforms".

History

The International Organization for Standardization (ISO) set out to compose a universal multi-byte character set in 1989. The draft ISO 10646 standard contained a non-required annex called UTF-1 that provided a byte stream encoding of its 32-bit code points. This encoding was not satisfactory on performance grounds, among other problems, and the biggest problem was probably that it did not have a clear separation between ASCII and non-ASCII: new UTF-1 tools would be backward compatible with ASCII-encoded text, but UTF-1-encoded text could confuse existing code expecting ASCII (or extended ASCII), because it could contain continuation bytes in the range 0x21–0x7E that meant something else in ASCII, e.g., 0x2F for '/', the Unix path directory separator, and this example is reflected in the name and introductory text of its replacement. The table below was derived from a textual description in the annex.

UTF-1
First code pointLast code pointByte 1Byte 2Byte 3Byte 4Byte 5
U+0000U+009F00–9F
U+00A0U+00FFA0A0–FF
U+0100U+4015A1–F521–7E, A0–FF
U+4016U+38E2DF6–FB21–7E, A0–FF21–7E, A0–FF
U+38E2EU+7FFFFFFFFC–FF21–7E, A0–FF21–7E, A0–FF21–7E, A0–FF21–7E, A0–FF

In July 1992, the X/Open committee XoJIG was looking for a better encoding. Dave Prosser of Unix System Laboratories submitted a proposal for one that had faster implementation characteristics and introduced the improvement that 7-bit ASCII characters would only represent themselves; all multi-byte sequences would include only bytes where the high bit was set. The name File System Safe UCS Transformation Format (FSS-UTF) and most of the text of this proposal were later preserved in the final specification.[55] [56] [57] [58]

FSS-UTF

FSS-UTF proposal (1992)
First code pointLast code pointByte 1Byte 2Byte 3Byte 4Byte 5
U+0000U+007F
U+0080U+207F
U+2080U+8207F
U+82080U+208207F
U+2082080U+7FFFFFFF

In August 1992, this proposal was circulated by an IBM X/Open representative to interested parties. A modification by Ken Thompson of the Plan 9 operating system group at Bell Labs made it self-synchronizing, letting a reader start anywhere and immediately detect character boundaries, at the cost of being somewhat less bit-efficient than the previous proposal. It also abandoned the use of biases and instead added the rule that only the shortest possible encoding is allowed; the additional loss in compactness is relatively insignificant, but readers now have to look out for invalid encodings to avoid reliability and especially security issues. Thompson's design was outlined on September 2, 1992, on a placemat in a New Jersey diner with Rob Pike. In the following days, Pike and Thompson implemented it and updated Plan 9 to use it throughout, and then communicated their success back to X/Open, which accepted it as the specification for FSS-UTF.[57]

FSS-UTF (1992) / UTF-8 (1993)
First code pointLast code pointByte 1Byte 2Byte 3Byte 4Byte 5Byte 6
U+0000U+007F
U+0080U+07FF
U+0800U+FFFF
U+10000U+1FFFFF
U+200000U+3FFFFFF
U+4000000U+7FFFFFFF

UTF-8 was first officially presented at the USENIX conference in San Diego, from January 25 to 29, 1993. The Internet Engineering Task Force adopted UTF-8 in its Policy on Character Sets and Languages in RFC 2277 (BCP 18) for future internet standards work in January 1998, replacing Single Byte Character Sets such as Latin-1 in older RFCs.

In November 2003, UTF-8 was restricted by to match the constraints of the UTF-16 character encoding: explicitly prohibiting code points corresponding to the high and low surrogate characters removed more than 3% of the three-byte sequences, and ending at U+10FFFF removed more than 48% of the four-byte sequences and all five- and six-byte sequences.

Standards

There are several current definitions of UTF-8 in various standards documents:

They supersede the definitions given in the following obsolete works:

They are all the same in their general mechanics, with the main differences being on issues such as allowed range of code point values and safe handling of invalid input.

Comparison with other encodings

See also: Comparison of Unicode encodings.

Some of the important features of this encoding are as follows:

Single-byte

Other multi-byte

UTF-16

See main article: UTF-16.

Derivatives

The following implementations show slight differences from the UTF-8 specification. They are incompatible with the UTF-8 specification and may be rejected by conforming UTF-8 applications.

CESU-8

See main article: CESU-8.

Unicode Technical Report #26[64] assigns the name CESU-8 to a nonstandard variant of UTF-8, in which Unicode characters in supplementary planes are encoded using six bytes, rather than the four bytes required by UTF-8. CESU-8 encoding treats each half of a four-byte UTF-16 surrogate pair as a two-byte UCS-2 character, yielding two three-byte UTF-8 characters, which together represent the original supplementary character. Unicode characters within the Basic Multilingual Plane appear as they would normally in UTF-8. The Report was written to acknowledge and formalize the existence of data encoded as CESU-8, despite the Unicode Consortium discouraging its use, and notes that a possible intentional reason for CESU-8 encoding is preservation of UTF-16 binary collation.

CESU-8 encoding can result from converting UTF-16 data with supplementary characters to UTF-8, using conversion methods that assume UCS-2 data, meaning they are unaware of four-byte UTF-16 supplementary characters. It is primarily an issue on operating systems which extensively use UTF-16 internally, such as Microsoft Windows.

In Oracle Database, the character set uses CESU-8 encoding, and is deprecated. The character set uses standards-compliant UTF-8 encoding, and is preferred.[65] [66]

CESU-8 is prohibited for use in HTML5 documents.[67] [68] [69]

MySQL utf8mb3

In MySQL, the character set is defined to be UTF-8 encoded data with a maximum of three bytes per character, meaning only Unicode characters in the Basic Multilingual Plane (i.e. from UCS-2) are supported. Unicode characters in supplementary planes are explicitly not supported. is deprecated in favor of the character set, which uses standards-compliant UTF-8 encoding. is an alias for, but is intended to become an alias to in a future release of MySQL. It is possible, though unsupported, to store CESU-8 encoded data in, by handling UTF-16 data with supplementary characters as though it is UCS-2.

Modified UTF-8

Modified UTF-8 (MUTF-8) originated in the Java programming language. In Modified UTF-8, the null character (U+0000) uses the two-byte overlong encoding (hexadecimal), instead of (hexadecimal).[70] Modified UTF-8 strings never contain any actual null bytes but can contain all Unicode code points including U+0000,[71] which allows such strings (with a null byte appended) to be processed by traditional null-terminated string functions. All known Modified UTF-8 implementations also treat the surrogate pairs as in CESU-8.

In normal usage, the language supports standard UTF-8 when reading and writing strings through and (if it is the platform's default character set or as requested by the program). However it uses Modified UTF-8 for object serialization[72] among other applications of and, for the Java Native Interface,[73] and for embedding constant strings in class files.[74]

The dex format defined by Dalvik also uses the same modified UTF-8 to represent string values.[75] Tcl also uses the same modified UTF-8[76] as Java for internal representation of Unicode data, but uses strict CESU-8 for external data.

WTF-8

In WTF-8 (Wobbly Transformation Format, 8-bit) unpaired surrogate halves (U+D800 through U+DFFF) are allowed.[77] This is necessary to store possibly-invalid UTF-16, such as Windows filenames. Many systems that deal with UTF-8 work this way without considering it a different encoding, as it is simpler.[78]

The term "WTF-8" has also been used humorously to refer to erroneously doubly-encoded UTF-8[79] [80] sometimes with the implication that CP1252 bytes are the only ones encoded.[81]

PEP 383

Version 3 of the Python programming language treats each byte of an invalid UTF-8 bytestream as an error (see also changes with new UTF-8 mode in Python 3.7[82]); this gives 128 different possible errors. Extensions have been created to allow any byte sequence that is assumed to be UTF-8 to be losslessly transformed to UTF-16 or UTF-32, by translating the 128 possible error bytes to reserved code points, and transforming those code points back to error bytes to output UTF-8. The most common approach is to translate the codes to U+DC80...U+DCFF which are low (trailing) surrogate values and thus "invalid" UTF-16, as used by Python's PEP 383 (or "surrogateescape") approach.[83] Another encoding called MirBSD OPTU-8/16 converts them to U+EF80...U+EFFF in a Private Use Area.[84] In either approach, the byte value is encoded in the low eight bits of the output code point.

These encodings are very useful because they avoid the need to deal with "invalid" byte strings until much later, if at all, and allow "text" and "data" byte arrays to be the same object. If a program wants to use UTF-16 internally these are required to preserve and use filenames that can use invalid UTF-8;[85] as the Windows filesystem API uses UTF-16, the need to support invalid UTF-8 is less there.[83]

For the encoding to be reversible, the standard UTF-8 encodings of the code points used for erroneous bytes must be considered invalid. This makes the encoding incompatible with WTF-8 or CESU-8 (though only for 128 code points). When re-encoding it is necessary to be careful of sequences of error code points which convert back to valid UTF-8, which may be used by malicious software to get unexpected characters in the output, though this cannot produce ASCII characters so it is considered comparatively safe, since malicious sequences (such as cross-site scripting) usually rely on ASCII characters.

See also

External links

Notes and References

  1. Book: The Unicode Standard . 6.0 . Chapter 2. General Structure . . Mountain View, California, US . 978-1-936213-01-6 . https://www.unicode.org/versions/Unicode6.0.0/.
  2. Web site: UTF-8 history . Rob . Pike . 30 April 2003 .
  3. Book: https://www.cl.cam.ac.uk/~mgk25/ucs/UTF-8-Plan9-paper.pdf . Hello World or Καλημέρα κόσμε or こんにちは 世界 . Proceedings of the Winter 1993 USENIX Conference . Rob . Pike . Ken . Thompson . 1993.
  4. Web site: File System Safe UCS - Transformation Format (FSS-UTF) - X/Open Preliminary Specification. unicode.org.
  5. Web site: USENIX Winter 1993 Conference Proceedings. usenix.org.
  6. 2277 . 18 . IETF Policy on Character Sets and Languages . January 1998 . Alvestrand . Harald T. . Harald Alvestrand . IETF.
  7. Web site: Encoding Standard § 4.2. Names and labels. WHATWG. 2018-04-29.
  8. Web site: . Character Sets . 2013-01-23 . 2013-02-08.
  9. Web site: UTF-8 codepage 65001 in Windows 7 - part I . Liviu . Previously under XP (and, unverified, but probably Vista, too) for loops simply did not work while codepage 65001 was active . en-gb . 2014-02-07 . 2018-01-30.
  10. Web site: MySQL :: MySQL 8.0 Reference Manual :: 10.9.1 The utf8mb4 Character Set (4-Byte UTF-8 Unicode Encoding) . MySQL 8.0 Reference Manual . . 2023-03-14.
  11. Web site: MySQL :: MySQL 8.0 Reference Manual :: 10.9.2 The utf8mb3 Character Set (3-Byte UTF-8 Unicode Encoding) . MySQL 8.0 Reference Manual . . 2023-02-24.
  12. Web site: HP PCL Symbol Sets Printer Control Language (PCL & PXL) Support Blog. https://web.archive.org/web/20150219212843/http://pclhelp.com/pcl-symbol-sets/. dead. 2015-02-19. 2015-02-19. 2018-01-30.
  13. Web site: Database Globalization Support Guide . 2023-03-16 . docs.oracle.com . en.
  14. Web site: BOM . suikawiki . https://web.archive.org/web/20090117052232/https://suika.fam.cx/~wakaba/wiki/sw/n/BOM . 2009-01-17 . ja.
  15. Web site: Davis . Mark . Mark Davis (Unicode) . Forms of Unicode . . 2013-09-18 . https://web.archive.org/web/20050506211548/https://www-128.ibm.com/developerworks/library/utfencodingforms/index.html . 2005-05-06.
  16. Web site: String . Apple Developer . 2021-03-15.
  17. Marvin . Marin . 2000-10-17 . Windows NT UNICODE vulnerability analysis . Web server folder traversal . MS00-078 . Malware FAQ . SANS Institute . dead . https://web.archive.org/web/20140827001204/http://www.sans.org/security-resources/malwarefaq/wnt-unicode.php . Aug 27, 2014.
  18. Web site: CVE-2008-2938 . 2008 . National Vulnerability Database (nvd.nist.gov) . .
  19. Web site: DataInput . Java Platform SE 8) . docs.oracle.com . 2021-03-24.
  20. Web site: Non-decodable bytes in system character interfaces . 2009-04-22 . python.org . 2014-08-13.
  21. Unicode 6.0.0 . October 2010 . unicode.org .
  22. one byte  128 = 128
    two bytes  (16 + 5) × 64 =
    three bytes  5 × 64 × 64 = +
      total
    There may be somewhat fewer if more precise tests are done for each continuation byte.
  23. Web site: UTF-8 and Unicode FAQ for Unix/Linux .
  24. Web site: Mark . Davis . Mark Davis (Unicode) . 2008-05-05 . Moving to Unicode 5.1 . Official Google blog . en . 2023-03-13.
  25. Web site: Usage Survey of Character Encodings broken down by Ranking . W3Techs . en . 2024-05-22.
  26. Web site: Usage statistics and market share of ASCII for websites . January 2024 . W3Techs . 2024-01-01.
  27. Bray . Tim . Bray . Tim . December 2017 . The JavaScript Object Notation (JSON) Data Interchange Format . IETF . 10.17487/RFC8259 . 16 February 2018 . 8259.
  28. Web site: Encoding Standard . encoding.spec.whatwg.org . 2020-04-15.
  29. Web site: Using International Characters in Internet Mail . Internet Mail Consortium . 1998-08-01 . 2007-11-08 . dead . https://web.archive.org/web/20071026103104/https://www.imc.org/mail-i18n.html . 2007-10-26 .
  30. Web site: Encoding Standard . encoding.spec.whatwg.org . en . 2018-11-15.
  31. Specifying the document's character encoding . HTML 5.2 . 14 December 2017 . . https://www.w3.org/TR/html5/document-metadata.html#charset . 2018-06-03 . cs1.
  32. Web site: Choose text encoding when you open and save files . Microsoft Support (support.microsoft.com). 2021-11-01.
  33. Web site: UTF-8 - Character encoding of Microsoft Word DOC and DOCX files? . Stack Overflow . 2021-11-01.
  34. Web site: Exporting a UTF-8 .txt file from Word . support.3playmedia.com .
  35. Web site: Are XLSX files UTF-8 encoded, by definition? . Excel . Stack Overflow . 2021-11-01.
  36. Web site: Abhinav, Ankit . Xu, Jazlyn . April 13, 2020 . How to open UTF-8 CSV file in Excel without mis-conversion of characters in Japanese and Chinese language for both Mac and Windows? . Microsoft Support Community . en-US . 2021-11-01.
  37. Web site: How can I make Notepad to save text in UTF-8 without the BOM? . Stack Overflow . 2021-03-24.
  38. Web site: Galloway . Matt . October 2012 . Character encoding for iOS developers; or, UTF-8 what now? . www.galloway.me.uk . en-UK . 2021-01-02 . ... in reality, you usually just assume UTF-8 since that is by far the most common encoding..
  39. Web site: Windows 10 Notepad is getting better UTF-8 encoding support . BleepingComputer . 2021-03-24 . Microsoft is now defaulting to saving new text files as UTF-8 without BOM, as shown below. . en-us.
  40. Web site: Customize the Windows 11 Start menu . 2021-06-29 . docs.microsoft.com . en-us . Make sure your LayoutModification.json uses UTF-8 encoding..
  41. Web site: UTF-8 by default . JEP 400 . openjdk.java.net . 2022-03-30.
  42. Web site: Set default for Encoding.default_external to UTF-8 on Windows . Ruby master . Feature #16604 . Ruby Issue Tracking System (bugs.ruby-lang.org) . 2022-08-01.
  43. Web site: Feature #12650: Use UTF-8 encoding for ENV on Windows . Ruby master . Ruby Issue Tracking System (bugs.ruby-lang.org) . 2022-08-01.
  44. Web site: New features in R 4.2.0 . 2022-04-01 . R bloggers (r-bloggers.com) . The Jumping Rivers Blog . 2022-08-01 . en-US.
  45. Web site: add a new UTF-8 mode . peps.python.org . PEP 540 . 2022-09-23.
  46. Web site: Make UTF-8 mode default . peps.python.org . PEP 686 . 2023-07-26.
  47. Web site: PEP 597 . Add optional EncodingWarning . Python.org . en . 2021-08-24.
  48. Support for UTF-8 as a portable source file encoding . 2022 . p2295r6 . open-std.org .
  49. Source code representation . The Go Programming Language Specification . golang.org . https://golang.org/ref/spec#Source_code_representation . 2021-02-10.
  50. Web site: PyPy v7.1 released; now uses UTF-8 internally for Unicode strings . Mattip . 2019-03-24 . PyPy status blog . 2020-11-21.
  51. Web site: Common Object Structures . 2024-05-29 . Python documentation . en.
  52. Web site: Unicode objects and codecs . 2023-08-19 . Python documentation . UTF-8 representation is created on demand and cached in the Unicode object. .
  53. Web site: Wouters . Thomas . 2023-07-11 . Python 3.12.0 beta 4 released . Python Insider (pythoninsider.blogspot.com) . blog post . 2023-07-26 . The deprecated wstr and wstr_length members of the C implementation of unicode objects were removed, per PEP 623..
  54. Web site: validate-charset (validate for compatible characters) . docs.microsoft.com . en-us . 2021-07-19 . Visual Studio uses UTF-8 as the internal character encoding during conversion between the source character set and the execution character set..
  55. Appendix F. FSS-UTF / File System Safe UCS Transformation format . The Unicode Standard 1.1 . 2016-06-07 . live . https://web.archive.org/web/20160607215950/https://www.unicode.org/versions/Unicode1.1.0/appF.pdf . 2016-06-07.
  56. Web site: FSS-UTF, UTF-2, UTF-8, and UTF-16 . Kenneth . Whistler . 2001-06-12 . 2006-06-07 . live . https://web.archive.org/web/20160607220249/https://unicode.org/mail-arch/unicode-ml/y2001-m06/0318.html . 2016-06-07 .
  57. Web site: UTF-8 history . Rob . Pike . Rob Pike . 2003-04-30 . 2012-09-07.
  58. Web site: UTF-8 turned 20 years old yesterday . Rob . Pike . Rob Pike . 2012-09-06 . 2012-09-07.
  59. https://www.iso.org/iso/home/store/catalogue_ics/catalogue_detail_ics.htm?csnumber=63182 ISO/IEC 10646:2014 §9.1
  60. The Unicode Standard, Version 15.0 §3.9 D92, §3.10 D95, 2021.
  61. https://www.unicode.org/reports/tr27/tr27-3.html Unicode Standard Annex #27: Unicode 3.1
  62. https://www.unicode.org/versions/Unicode5.0.0/ The Unicode Standard, Version 5.0
  63. https://www.unicode.org/versions/Unicode6.0.0/ The Unicode Standard, Version 6.0
  64. Web site: Compatibility Encoding Scheme for UTF-16: 8-Bit (CESU-8) . Rick . McGowan . 2011-12-19 . Unicode Technical Report #26 . Unicode Consortium.
  65. Web site: Character Set Support . Oracle Database 19c Documentation, SQL Language Reference . Oracle Corporation.
  66. Web site: Supporting Multilingual Databases with Unicode § Support for the Unicode Standard in Oracle Database . Database Globalization Support Guide . Oracle Corporation.
  67. Web site: 8.2.2.3. Character encodings . HTML 5.1 Standard . W3C.
  68. Web site: 8.2.2.3. Character encodings . HTML 5 Standard . W3C.
  69. Web site: 12.2.3.3 Character encodings . HTML Living Standard . WHATWG.
  70. Web site: Java SE documentation for Interface java.io.DataInput, subsection on Modified UTF-8 . 2015 . . 2015-10-16.
  71. Web site: The Java Virtual Machine Specification, section 4.4.7: "The CONSTANT_Utf8_info Structure" . . 2015 . 2015-10-16.
  72. Web site: Java Object Serialization Specification, chapter 6: Object Serialization Stream Protocol, section 2: Stream Elements . 2010 . . 2015-10-16.
  73. Web site: Java Native Interface Specification, chapter 3: JNI Types and Data Structures, section: Modified UTF-8 Strings . . 2015 . 2015-10-16.
  74. Web site: The Java Virtual Machine Specification, section 4.4.7: "The CONSTANT_Utf8_info Structure" . . 2015 . 2015-10-16.
  75. Web site: ART and Dalvik . Android Open Source Project . 2013-04-09 . dead . https://web.archive.org/web/20130426010617/https://source.android.com/tech/dalvik/dex-format.html . 2013-04-26 .
  76. Web site: UTF-8 bit by bit . 2001-02-28 . 2022-09-03 . Tcler's Wiki.
  77. Web site: The WTF-8 encoding . Simon . Sapin . 2016-03-11 . 2014-09-25 . 2016-05-24 . live . https://web.archive.org/web/20160524180037/https://simonsapin.github.io/wtf-8/ . 2016-05-24.
  78. Web site: The WTF-8 encoding § Motivation . Simon . Sapin . 2015-03-25 . 2014-09-25 . 2020-08-26 . live . https://web.archive.org/web/20200816090721/https://simonsapin.github.io/wtf-8/#motivation . 2020-08-16 .
  79. Web site: WTF-8.com. 2006-05-18. 2016-06-21.
  80. Web site: ftfy (fixes text for you) 4.0: changing less and fixing more. Robyn. Speer. 2015-05-21. 2016-06-21. https://web.archive.org/web/20150530150039/https://blog.luminoso.com/2015/05/21/ftfy-fixes-text-for-you-4-0-changing-less-and-fixing-more/. 2015-05-30.
  81. Web site: WTF-8, a transformation format of code page 1252. 2016-10-12 . dead . https://web.archive.org/web/20161013072641/http://www-uxsup.csx.cam.ac.uk/~fanf2/hermes/doc/qsmtp/draft-fanf-wtf8.html . 2016-10-13 .
  82. Web site: PEP 540 -- Add a new UTF-8 Mode. 2021-03-24. Python.org. en.
  83. Web site: PEP 383 . Non-decodable Bytes in System Character Interfaces . . en . Martin . von Löwis . 2009-04-22.
  84. Web site: RTFM optu8to16(3), optu8to16vis(3) . www.mirbsd.org.
  85. Web site: 3.7 Enabling Lossless Conversion . Davis . Mark . Mark Davis (Unicode) . Michel . Suignard . Unicode Security Considerations . Unicode Technical Report #36 . 2014.