URL encoding, officially known as percent-encoding, is a method to encode arbitrary data in a uniform resource identifier (URI) using only the US-ASCII characters legal within a URI. Although it is known as URL encoding, it is also used more generally within the main Uniform Resource Identifier (URI) set, which includes both Uniform Resource Locator (URL) and Uniform Resource Name (URN). Consequently, it is also used in the preparation of data of the application/x-www-form-urlencoded
media type, as is often used in the submission of HTML form data in HTTP requests.
The characters allowed in a URI are either reserved or unreserved (or a percent character as part of a percent-encoding). Reserved characters are those characters that sometimes have special meaning. For example, forward slash characters are used to separate different parts of a URL (or more generally, a URI). Unreserved characters have no such meanings. Using percent-encoding, reserved characters are represented using special character sequences. The sets of reserved and unreserved characters and the circumstances under which certain reserved characters have special meaning have changed slightly with each revision of specifications that govern URIs and URI schemes.
[[exclamation mark|!]] | [[number sign|#]] | [[dollar sign|$]] | [[ampersand|&]] | [[apostrophe (mark)|']] | [[parenthesis|(]] | [[parenthesis|)]] | [[asterisk|<nowiki>*</nowiki>]] || [[plus sign|+]] || [[Comma|,]] || [[slash (punctuation)|/]] || [[colon (punctuation)|:]] || [[semicolon|;]] || [[equal sign|=]] || [[question mark|?]] || [[At sign|@]] || [[square_bracket|[]] || [[square_bracket|<nowiki>]</nowiki>]] |}
Other characters in a URI must be percent-encoded. Reserved charactersWhen a character from the reserved set (a "reserved character") has a special meaning (a "reserved purpose") in a certain context, and a URI scheme says that it is necessary to use that character for some other purpose, then the character must be percent-encoded. Percent-encoding a reserved character involves converting the character to its corresponding byte value in ASCII and then representing that value as a pair of hexadecimal digits (if there is a single hex digit, a leading zero is added). The digits, preceded by a percent sign (
|
[[exclamation mark|!]] | [[number sign|#]] | [[dollar sign|$]] | [[ampersand|&]] | [[apostrophe (mark)|']] | [[parenthesis|(]] | [[parenthesis|)]] | [[asterisk|<nowiki>*</nowiki>]] || [[plus sign|+]] || [[Comma|,]] || [[slash (punctuation)|/]] || [[colon (punctuation)|:]] || [[semicolon|;]] || [[equal sign|=]] || [[question mark|?]] || [[At sign|@]] || [[square_bracket|[]] || [[square_bracket|<nowiki>]</nowiki>]] |-| %21 || %23 || %24 || %26 || %27 || %28 || %29 || %2A || %2B || %2C || %2F || %3A || %3B || %3D || %3F || %40 || %5B || %5D |}Reserved characters that have no reserved purpose in a particular context may also be percent-encoded but are not semantically different from those that are not. In the "query" component of a URI (the part after a URIs that differ only by whether a reserved character is percent-encoded or appears literally are normally considered not equivalent (denoting the same resource) unless it can be determined that the reserved characters in question have no reserved purpose. This determination is dependent upon the rules established for reserved characters by individual URI schemes. Unreserved charactersCharacters from the unreserved set never need to be percent-encoded.URIs that differ only by whether an unreserved character is percent-encoded or appears literally are equivalent by definition, but URI processors, in practice, may not always recognize this equivalence. For example, URI consumers should not treat Percent characterArbitrary dataMost URI schemes involve the representation of arbitrary data, such as an IP address or file system path, as components of a URI. URI scheme specifications should, but often do not, provide an explicit mapping between URI characters and all possible data values being represented by those characters.Binary dataSince the publication of RFC 1738 in 1994 it has been specified that schemes that provide for the representation of binary data in a URI must divide the data into 8-bit bytes and percent-encode each byte in the same manner as above.[1] Byte value 0x0F, for example, should be represented by Character dataThe procedure for percent-encoding binary data has often been extrapolated, sometimes inappropriately or without being fully specified, to apply to character-based data. In the World Wide Web's formative years, when dealing with data characters in the ASCII repertoire and using their corresponding bytes in ASCII as the basis for determining percent-encoded sequences, this practice was relatively harmless; it was just assumed that characters and bytes mapped one-to-one and were interchangeable. The need to represent characters outside the ASCII range, however, grew quickly, and URI schemes and protocols often failed to provide standard rules for preparing character data for inclusion in a URI. Web applications consequently began using different multi-byte, stateful, and other non-ASCII-compatible encodings as the basis for percent-encoding, leading to ambiguities and difficulty interpreting URIs reliably. For example, many URI schemes and protocols based on RFCs 1738 and 2396 presume that the data characters will be converted to bytes according to some unspecified character encoding before being represented in a URI by unreserved characters or percent-encoded bytes. If the scheme does not allow the URI to provide a hint as to what encoding was used, or if the encoding conflicts with the use of ASCII to percent-encode reserved and unreserved characters, then the URI cannot be reliably interpreted. Some schemes fail to account for encoding at all and instead just suggest that data characters map directly to URI characters, which leaves it up to implementations to decide whether and how to percent-encode data characters that are in neither the reserved nor unreserved sets.
|