Windows-1252 Explained

Windows-1252
Mime:windows-1252
Alias:cp1252 (code page 1252)
By:Microsoft
Standard:WHATWG Encoding Standard
Lang:All supported by ISO/IEC 8859-1 plus full support for French and Finnish and ligature forms for English; e.g. Danish (except for a rare exceptional letter), Irish, Italian, Norwegian, Portuguese, Spanish, Swedish, German (missing uppercase ), Icelandic, Faroese, Luxembourgish, Albanian, Estonian, Swahili, Tswana, Catalan, Basque, Occitan, Rotokas, Toki Pona, Lojban, Romansh, Dutch (except the IJ/ij character, substituted by IJ/ij or ÿ), and Slovene (except the č character, substituted by ç).
Extends:ISO 8859-1 (excluding C1 controls)
Next:Unicode (UTF-8, UTF-16)
Encodes:ISO 8859-15
Classification:extended ASCII, Windows-125x

Windows-1252 or CP-1252 (Windows code page 1252) is a legacy single-byte character encoding[1] that is used by default (as the "ANSI code page") in Microsoft Windows throughout the Americas, Western Europe, Oceania, and much of Africa.

Initially the same as ISO 8859-1, it began to diverge starting in Windows 2.0 by adding additional characters in the 0x80 to 0x9F (hex) range (the ISO standards reserve this range for C1 control codes). Notable additional characters include curly quotation marks and all printable characters from ISO 8859-15.

It is the most-used single-byte character encoding in the world. Although almost all websites now use the multi-byte character encoding UTF-8, as of July 2024 1.2% of websites declared ISO 8859-1 which is treated as Windows-1252 by all modern browsers (as demanded by the HTML5 standard), plus 0.3% declared Windows-1252 directly,[2] [3] for a total of 1.5%. Some countries or languages show a higher usage than the global average, in 2024 Brazil according to website use, use is at 3.4%,[4] and in Germany at 2.7%.[5] [6] (these are the sums of ISO-8859-1 and CP-1252 declarations).

Name

It is known to Windows by the code page number 1252, and by the IANA-approved name "windows-1252".

Historically, the phrase "ANSI Code Page" was used in Windows to refer to non-DOS encodings; the intention was that most of these would be ANSI standards such as ISO-8859-1. Even though Windows-1252 was the first and by far most popular code page named so in Microsoft Windows parlance, the code page has never been an ANSI standard. Microsoft explains, "The term ANSI as used to signify Windows code pages is a historical reference, but is nowadays a misnomer that continues to persist in the Windows community."[7]

LaTeX can input Windows-1252 by using inputenc.sty with parameter ansinew (and more recently cp1252).[8] [9]

IBM uses code page 1252 (CCSID 1252 and euro sign extended CCSID 5348) for Windows-1252.[10] [11] [12]

It is called "WE8MSWIN1252" by Oracle Database.[13]

History

Starting in the 1990s, many Microsoft products that could produce HTML included Windows-1252-exclusive characters, but marked the encoding as ISO-8859-1, ASCII, or undeclared. Characters exclusive to Windows-1252 would render incorrectly on non-Windows operating systems (often as question marks).[14] [15] In particular, typographers' quotes—curly variants of the standard straight apostrophes and quotation marks in US-ASCII—were commonly used in files produced in Windows applications such as Microsoft Word due to the smart quotes feature, which can automatically convert straight apostrophes and quotation marks to the curly variants.[16] To fix this, by 2000 most web browsers and e-mail clients treated the charsets ISO-8859-1 and US-ASCII as Windows-1252—this behavior is now required by the HTML5 specification.[17] Undeclared charsets in HTML are also assumed to be Windows-1252.[18] [19]

Although Windows NT supported Unicode and attempted to encourage programs to use it, it only provided the 16-bit code units of UCS-2/UTF-16, despite the existing support for other multibyte character encodings. As many applications preferred to use 8-bit strings, Windows-1252 remained the most popular encoding on Windows even after it added support for UTF-16. Unicode support in Windows has improved over time, with UTF-8 support available starting in Windows 10.

Codepage layout

The following table shows Windows-1252. Differences from ISO-8859-1 have the Unicode code point number below the character, based on the Unicode.org mapping of Windows-1252 with "best fit". A tooltip, generally available only when one points to the immediate left of the character, shows the Unicode code point name and the decimal Alt code.

According to the information on Microsoft's and the Unicode Consortium's websites, positions 81, 8D, 8F, 90, and 9D are unused; however, the Windows API [https://msdn.microsoft.com/en-us/library/dd319072.aspx MultiByteToWideChar] maps these to the corresponding C1 control codes. The "best fit" mapping documents this behavior, too.

Related encodings

OS/2 extensions

The OS/2 operating system supports an encoding by the name of Code page 1004 (CCSID 1004) or "Windows Extended".[20] [21] This mostly matches code page 1252, with the exception of certain C0 control characters being replaced by diacritic characters.

MS-DOS extensions (rare)

There is a rarely used, but useful, graphics extended code page 1252 where codes 0x00 to 0x1f allow for box drawing as used in applications such as MSDOS Edit and Codeview. One of the applications to use this code page was an Intel Corporation Install/Recovery disk image utility from mid/late 1995. These programs were written for its P6 User Test Program machines (US example[22]). It was used exclusively in its then EMEA region (Europe, Middle East & Africa). In time the programs were changed to use code page 850.

Palm OS variant

Each Palm OS device supports a single language and a single character encoding, depending on its locale.[23]

For languages such as English and French, Palm OS uses a custom character encoding based on Windows-1252. For Japanese, it instead uses a multibyte character encoding based on code page 932. Regardless of the system locale, all characters in the range 0x00 to 0x7F are guaranteed to be the same, except 0x5D which is the Yen sign in Japanese and a backslash on all others.[23]

Palm OS 3.1 introduced several changes to the character encoding to better align with Windows-1252:[24]

The following is the variant of Windows-1252 used by Palm OS 3.3 onward for English and several other locales.[25] Python gives it the label, describing it as the encoding for Palm OS 3.5.[26] [27] Differences from Windows-1252 have their Unicode code point.

See also

External links

Notes and References

  1. Web site: Encoding. Living Standard . § 9. Legacy single-byte encodings . . 13 June 2024 . 2024-06-28.
  2. Web site: Historical trends in the usage statistics of character encodings for websites, December 2023. 2024-07-19. w3techs.com.
  3. Web site: Frequenty Asked Questions. w3techs.com.
  4. Web site: Distribution of Character Encodings among websites that use Brazil . 2024-07-19 . W3Techs . live . https://archive.today/20240404231434/https://w3techs.com/technologies/segmentation/sl-br-/character_encoding . 4 Apr 2024 .
  5. Web site: Distribution of Character Encodings among websites that use .de. 2024-07-19. W3Techs . live . https://archive.today/20240404231947/https://w3techs.com/technologies/segmentation/tld-de-/character_encoding . 4 Apr 2024 .
  6. Web site: Distribution of Character Encodings among websites that use German. 2023-01-16. w3techs.com.
  7. Web site: Unicode and Windows XP . 1 . Wissink . Cathy . . 5 April 2002 . 4 February 2015 . https://web.archive.org/web/20150204175931/http://download.microsoft.com/download/5/6/8/56803da0-e4a0-4796-a62c-ca920b73bb17/21-Unicode_WinXP.pdf . 4 February 2015 . dead.
  8. Web site: LaTeX News, Issue 28 . Apr 2018 . The LaTeX Project . PDF; 379 KB . 2024-07-27.
  9. Web site: Inputenc – Accept different input encodings . 2024-02-08 . The LaTeX Project . 2024-07-27.
  10. Web site: Code page 1252 information document. https://web.archive.org/web/20160303215813/http://www-01.ibm.com/software/globalization/cp/cp01252.html. IBM. 30 September 1997. 2016-03-03.
  11. Web site: CCSID 1252 information document. https://web.archive.org/web/20160326201651/http://www-01.ibm.com/software/globalization/ccsid/ccsid1252.html. 2016-03-26. IBM.
  12. Web site: CCSID 5348 information document. https://web.archive.org/web/20141129215139/http://www-01.ibm.com/software/globalization/ccsid/ccsid5348.html. 2014-11-29. IBM.
  13. Web site: Database Client Installation Guide. 2021-02-14. Oracle.
  14. Web site: Texin . Tex . Comparing Characters in Windows-1252, ISO-8859-1, ISO-8859-15 . I18nQA.com.
  15. Web site: van Emden . Eva . How to make typographers' quotes in HTML . vancouvereditor.com . 7 January 2024 . 28 January 2011 . If you use typographers' quotes without specifying the right character encoding for your HTML file, some of your viewers are going to see question marks, boxes, or other crazy symbols instead of the beautiful curly quotes you intended them to see..
  16. Web site: Smart quotes in Word . Microsoft Support . Microsoft . 7 January 2024.
  17. Web site: Encoding . sec. 5.2 Names and labels . . 27 January 2015 . 4 February 2015 . https://web.archive.org/web/20150204174315/https://encoding.spec.whatwg.org/#names-and-labels . 4 February 2015 . live.
  18. Web site: NetWare Web Search: Understanding Character Set Encodings . Novell Documentation . Novell . if a document does not contain a CHARSET encoding value, the default encoding for HTML documents is ISO-8859-1, also known as Latin1. The default encoding for plain text documents is US-ASCII..
  19. Observed behavior in Chrome, this may be UTF-8 in some browsers.
  20. Web site: Code page 1004 information document. https://web.archive.org/web/20150625021145/http://www-01.ibm.com/software/globalization/cp/cp01004.html. 2015-06-25.
  21. Web site: CCSID 1004 information document. https://web.archive.org/web/20160326215410/http://www-01.ibm.com/software/globalization/ccsid/ccsid1004.html. 2016-03-26.
  22. Book: https://pdfs.semanticscholar.org/88a2/d8b2ae94d7e2581fb3e94e2cd71635a8d655.pdf. https://web.archive.org/web/20190503085826/https://pdfs.semanticscholar.org/88a2/d8b2ae94d7e2581fb3e94e2cd71635a8d655.pdf. dead. 2019-05-03. Performance of NASA Equation Solvers on Computational Mechanics Applications. Performance of the NASA equation solvers on computational mechanics applications. 1996. NASA. 10.2514/6.1996-1505. 15711051. Storaasli. Olaf.
  23. Book: Palm OS Programmer's Companion . March 16, 2000 . Palm Computing Platform . 321 . Chapter 13: Localized Applications.
  24. Book: Palm OS SDK Reference . March 16, 2000 . Palm Computing Platform . Appendix B: Compatibility Guide . 1181–1182.
  25. Web site: Walleij . Linus . Palm Pilot Character Sets And Unicode Mappings . GNU Recode . Datorföreningen vid Lunds Universitet och Lunds Tekniska Högskola . 10 October 2023.
  26. Web site: codecs—Codec registry and base classes (§ Text Encodings) . The Python Standard Library—Python 3.9.4 Documentation . Python Software Foundation.
  27. Web site: Python Character Mapping Codec for Palm OS 3.5 . Sjoerd . Mullender . CPython source tree . 13 July 2002 . 9 December 2021 . Python Software Foundation.