Arithmetic coding explained

Arithmetic coding (AC) is a form of entropy encoding used in lossless data compression. Normally, a string of characters is represented using a fixed number of bits per character, as in the ASCII code. When a string is converted to arithmetic encoding, frequently used characters will be stored with fewer bits and not-so-frequently occurring characters will be stored with more bits, resulting in fewer bits used in total. Arithmetic coding differs from other forms of entropy encoding, such as Huffman coding, in that rather than separating the input into component symbols and replacing each with a code, arithmetic coding encodes the entire message into a single number, an arbitrary-precision fraction q, where . It represents the current information as a range, defined by two numbers.^[1] A recent family of entropy coders called asymmetric numeral systems allows for faster implementations thanks to directly operating on a single natural number representing the current information.^[2]

Implementation details and examples

Equal probabilities

In the simplest case, the probability of each symbol occurring is equal. For example, consider a set of three symbols, A, B, and C, each equally likely to occur. Encoding the symbols one by one would require 2 bits per symbol, which is wasteful: one of the bit variations is never used. That is to say, symbols A, B and C might be encoded respectively as 00, 01 and 10, with 11 unused.

A more efficient solution is to represent a sequence of these three symbols as a rational number in base 3 where each digit represents a symbol. For example, the sequence "ABBCAB" could become 0.011201₃, in arithmetic coding as a value in the interval [0, 1). The next step is to encode this ternary number using a fixed-point binary number of sufficient precision to recover it, such as 0.0010110001₂ – this is only 10 bits; 2 bits are saved in comparison with naïve block encoding. This is feasible for long sequences because there are efficient, in-place algorithms for converting the base of arbitrarily precise numbers.

To decode the value, knowing the original string had length 6, one can simply convert back to base 3, round to 6 digits, and recover the string.

Defining a model

In general, arithmetic coders can produce near-optimal output for any given set of symbols and probabilities. (The optimal value is −log₂P bits for each symbol of probability P; see Source coding theorem.) Compression algorithms that use arithmetic coding start by determining a model of the data – basically a prediction of what patterns will be found in the symbols of the message. The more accurate this prediction is, the closer to optimal the output will be.

Example: a simple, static model for describing the output of a particular monitoring instrument over time might be:

60% chance of symbol NEUTRAL
20% chance of symbol POSITIVE
10% chance of symbol NEGATIVE
10% chance of symbol END-OF-DATA. (The presence of this symbol means that the stream will be 'internally terminated', as is fairly common in data compression; when this symbol appears in the data stream, the decoder will know that the entire stream has been decoded.)

Models can also handle alphabets other than the simple four-symbol set chosen for this example. More sophisticated models are also possible: higher-order modelling changes its estimation of the current probability of a symbol based on the symbols that precede it (the context), so that in a model for English text, for example, the percentage chance of "u" would be much higher when it follows a "Q" or a "q". Models can even be adaptive, so that they continually change their prediction of the data based on what the stream actually contains. The decoder must have the same model as the encoder.

Encoding and decoding: overview

In general, each step of the encoding process, except for the last, is the same; the encoder has basically just three pieces of data to consider:

The next symbol that needs to be encoded
The current interval (at the very start of the encoding process, the interval is set to [0,1], but that will change)
The probabilities the model assigns to each of the various symbols that are possible at this stage (as mentioned earlier, higher-order or adaptive models mean that these probabilities are not necessarily the same in each step.)

The encoder divides the current interval into sub-intervals, each representing a fraction of the current interval proportional to the probability of that symbol in the current context. Whichever interval corresponds to the actual symbol that is next to be encoded becomes the interval used in the next step.

Example: for the four-symbol model above:

the interval for NEUTRAL would be [0, 0.6)
the interval for POSITIVE would be [0.6, 0.8)
the interval for NEGATIVE would be [0.8, 0.9)
the interval for END-OF-DATA would be [0.9, 1).

When all symbols have been encoded, the resulting interval unambiguously identifies the sequence of symbols that produced it. Anyone who has the same final interval and model that is being used can reconstruct the symbol sequence that must have entered the encoder to result in that final interval.

It is not necessary to transmit the final interval, however; it is only necessary to transmit one fraction that lies within that interval. In particular, it is only necessary to transmit enough digits (in whatever base) of the fraction so that all fractions that begin with those digits fall into the final interval; this will guarantee that the resulting code is a prefix code.

Encoding and decoding: example

Consider the process for decoding a message encoded with the given four-symbol model. The message is encoded in the fraction 0.538 (using decimal for clarity, instead of binary; also assuming that there are only as many digits as needed to decode the message.)

The process starts with the same interval used by the encoder: [0,1), and using the same model, dividing it into the same four sub-intervals that the encoder must have. The fraction 0.538 falls into the sub-interval for NEUTRAL, [0, 0.6); this indicates that the first symbol the encoder read must have been NEUTRAL, so this is the first symbol of the message.

Next divide the interval [0, 0.6) into sub-intervals:

the interval for NEUTRAL would be [0, 0.36), 60% of [0, 0.6).
the interval for POSITIVE would be [0.36, 0.48), 20% of [0, 0.6).
the interval for NEGATIVE would be [0.48, 0.54), 10% of [0, 0.6).
the interval for END-OF-DATA would be [0.54, 0.6), 10% of [0, 0.6).

Since 0.538 is within the interval [0.48, 0.54), the second symbol of the message must have been NEGATIVE.

Again divide our current interval into sub-intervals:

the interval for NEUTRAL would be [0.48, 0.516).
the interval for POSITIVE would be [0.516, 0.528).
the interval for NEGATIVE would be [0.528, 0.534).
the interval for END-OF-DATA would be [0.534, 0.540).

Now 0.538 falls within the interval of the END-OF-DATA symbol; therefore, this must be the next symbol. Since it is also the internal termination symbol, it means the decoding is complete. If the stream is not internally terminated, there needs to be some other way to indicate where the stream stops. Otherwise, the decoding process could continue forever, mistakenly reading more symbols from the fraction than were in fact encoded into it.

Sources of inefficiency

The message 0.538 in the previous example could have been encoded by the equally short fractions 0.534, 0.535, 0.536, 0.537 or 0.539. This suggests that the use of decimal instead of binary introduced some inefficiency. This is correct; the information content of a three-digit decimal is

3 x log₂₍₁₀₎ ≈ 9.966

bits; the same message could have been encoded in the binary fraction 0.10001001 (equivalent to 0.53515625 decimal) at a cost of only 8bits.

This 8 bit output is larger than the information content, or entropy of the message, which is

$\sum -\log_2(p_i) = -\log_2(0.6) - \log_2(0.1) - \log_2(0.1) = 7.381 \text.$

But an integer number of bits must be used in the binary encoding, so an encoder for this message would use at least 8 bits, resulting in a message 8.4% larger than the entropy contents. This inefficiency of at most 1 bit results in relatively less overhead as the message size grows.

Moreover, the claimed symbol probabilities were [0.6, 0.2, 0.1, 0.1), but the actual frequencies in this example are [0.33, 0, 0.33, 0.33). If the intervals are readjusted for these frequencies, the entropy of the message would be 4.755 bits and the same NEUTRAL NEGATIVE END-OF-DATA message could be encoded as intervals [0, 1/3); [1/9, 2/9); [5/27, 6/27); and a binary interval of [0.00101111011, 0.00111000111). This is also an example of how statistical coding methods like arithmetic encoding can produce an output message that is larger than the input message, especially if the probability model is off.

Adaptive arithmetic coding

One advantage of arithmetic coding over other similar methods of data compression is the convenience of adaptation. Adaptation is the changing of the frequency (or probability) tables while processing the data. The decoded data matches the original data as long as the frequency table in decoding is replaced in the same way and in the same step as in encoding. The synchronization is, usually, based on a combination of symbols occurring during the encoding and decoding process.

Precision and renormalization

The above explanations of arithmetic coding contain some simplification. In particular, they are written as if the encoder first calculated the fractions representing the endpoints of the interval in full, using infinite precision, and only converted the fraction to its final form at the end of encoding. Rather than try to simulate infinite precision, most arithmetic coders instead operate at a fixed limit of precision which they know the decoder will be able to match, and round the calculated fractions to their nearest equivalents at that precision. An example shows how this would work if the model called for the interval to be divided into thirds, and this was approximated with 8 bit precision. Note that since now the precision is known, so are the binary ranges we'll be able to use.

Notes and References

Book: Ze-Nian Li. Mark S. Drew. Jiangchuan Liu. Fundamentals of Multimedia. 9 April 2014. Springer Science & Business Media. 978-3-319-05290-8.
https://ieeexplore.ieee.org/document/7170048/ J. Duda, K. Tahboub, N. J. Gadil, E. J. Delp, The use of asymmetric numeral systems as an accurate replacement for Huffman coding
PhD . Pasco . Richard Clark . May 1976 . Source coding algorithms for fast data compression . Stanford Univ. 10.1.1.121.3377 .
Web site: What is JPEG?. comp.compression Frequently Asked Questions (part 1/3).
Web site: Recommendation T.81 (1992) Corrigendum 1 (01/04) . 9 November 2004 . Recommendation T.81 (1992) . International Telecommunication Union . 3 February 2011.
Book: JPEG Still Image Data Compression Standard. W. B.. Pennebaker. J. L.. Mitchell. Kluwer Academic Press. 1992. 0442012721.
Web site: T.81 – DIGITAL COMPRESSION AND CODING OF CONTINUOUS-TONE STILL IMAGES – REQUIREMENTS AND GUIDELINES . . September 1992 . 12 July 2019.
Web site: Frequently Asked Questions. comp.compression.
Web site: Dirac video codec 1.0 released [LWN.net]]. lwn.net.
For instance, discuss versions of arithmetic coding based on real-number ranges, integer approximations to those ranges, and an even more restricted type of approximation that they call binary quasi-arithmetic coding. They state that the difference between real and integer versions is negligible, prove that the compression loss for their quasi-arithmetic method can be made arbitrarily small, and bound the compression loss incurred by one of their approximations as less than 0.06%. See: .