A variable-length quantity (VLQ) is a universal code that uses an arbitrary number of binary octets (eight-bit bytes) to represent an arbitrarily large integer. A VLQ is essentially a base-128 representation of an unsigned integer with the addition of the eighth bit to mark continuation of bytes. VLQ is identical to LEB128 except in endianness. See the example below.
Base-128 compression is known by many namesVB (Variable Byte), VByte, Varint, VInt, EncInt etc.[1]
A variable-length quantity (VLQ) was defined for use in the standard MIDI file format[2] to save additional space for a resource-constrained system, and is also used in the later Extensible Music Format (XMF).
Base-128 is also used in ASN.1 BER encoding to encode tag numbers and object identifiers.[3] It is also used in the WAP environment, where it is called variable length unsigned integer or uintvar. The DWARF debugging format[4] defines a variant called LEB128 (or ULEB128 for unsigned numbers), where the least significant group of 7 bits is encoded in the first byte, and the most significant bits are in the last byte (so effectively it is the little-endian analog of a VLQ). Google Protocol Buffers use a similar format to have compact representation of integer values,[5] as does Oracle Portable Object Format (POF)[6] and the Microsoft .NET Framework "7-bit encoded int" in the BinaryReader and BinaryWriter classes.[7]
It is also used extensively in web browsers for source mapping – which contain a lot of integer line and column number mappings – to keep the size of the map to a minimum.[8]
Variable-width integers in LLVM use a similar principle. The encoding chunks are little-endian and need not be 8 bits in size. The LLVM documentation describes a field that uses 4-bit chunk, with each chunk consisting of 1 bit continuation and 3 bits payload.[9]
The encoding assumes an octet (an 8-bit byte) where the most significant bit (MSB), also commonly known as the sign bit, is reserved to indicate whether another VLQ octet follows.
If A is 0, then this is the last VLQ octet of the integer. If A is 1, then another VLQ octet follows.
B is a 7-bit number [0x00, 0x7F], and n is the position of the VLQ octet where B0 is the least significant. The VLQ octets are arranged most significant first in a stream.
The general VLQ encoding is simple, but in basic form is only defined for unsigned integers (nonnegative, positive or zero), and is somewhat redundant, since prepending 0x80 octets corresponds to zero padding. There are various signed number representations to handle negative numbers, and techniques to remove the redundancy.
Google developed Group Varint Encoding (GVE) after observing that traditional VLQ encoding incurs many CPU branches during decompression. GVE uses a single byte as a header for 4 variable-length uint32 values. The header byte has 4 2-bit numbers representing the storage length of each of the following 4 uint32s. Such a layout eliminates the need to check and remove VLQ continuation bits. Data bytes can be copied directly to their destination. This layout reduces CPU branches, making GVE faster than VLQ on modern pipelined CPUs.[10]
PrefixVarint is a similar design but with a uint64 maximum. It is said to have "been invented multiple times independently".[11] It is possible to be changed into a chained version with infinitely many continuations.
Negative numbers can be handled using a sign bit, which only needs to be present in the first octet.
In the data format for Unreal Packages used by the Unreal Engine, a variable-length quantity scheme called Compact Indices[12] is used. The only difference in this encoding is that the first VLQ octet has the sixth bit reserved to indicate whether the encoded integer is positive or negative. Any consecutive VLQ octet follows the general structure.
Other VLQ octets | ||||||||||||||||
7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | |
A | B | C0 | A | Cn (n > 0) |
If A is 0, then this is the last VLQ octet of the integer. If A is 1, then another VLQ octet follows.
If B is 0, then the VLQ represents a positive integer. If B is 1, then the VLQ represents a negative number.
C is the number chunk being encoded, and n is the position of the VLQ octet where C0 is the least significant. The VLQ octets are arranged least significant first in a stream.
An alternative way to encode negative numbers is to use the least significant bit for sign. This is notably done for Google Protocol Buffers, and is known as a zigzag encoding for signed integers.[13] One can encode the numbers so that encoded 0 corresponds to 0, 1 to -1, 10 to 1, 11 to -2, 100 to 2, etc.: counting up alternates between nonnegative (starting at 0) and negative (since each step changes the least-significant bit, hence the sign), whence the name "zigzag encoding". Concretely, transform the integer as (n << 1) ^ (n >> k - 1)
for fixed k-bit integers.
LEB128 uses two's complement to represent signed numbers. In this scheme of representation, n bits encode a range from −2n to 2n − 1, and all negative numbers start with a 1 in the most significant bit. In Signed LEB128, the input is sign-extended so that its length is a multiple of 7 bits. From there the encoding proceeds as usual.[14]
In LEB128, the stream is arranged least significant first.[14]
With the VLQ encoding described above, any number that can be encoded with N octets can also be encoded with more than N octets simply by prepending additional 0x80 octets as zero-padding. For example, the decimal number 358 can be encoded as the 2-octet VLQ 0x8266, or the number 0358 can be encoded as 3-octet VLQ 0x808266, or 00358 as the 4-octet VLQ 0x80808266 and so forth.
However, the VLQ format used in Git[15] removes this prepending redundancy and extends the representable range of shorter VLQs by adding an offset to VLQs of 2 or more octets in such a way that the lowest possible value for such an (N + 1)-octet VLQ becomes exactly one more than the maximum possible value for an N-octet VLQ. In particular, since a 1-octet VLQ can store a maximum value of 127, the minimum 2-octet VLQ (0x8000) is assigned the value 128 instead of 0. Conversely, the maximum value of such a 2-octet VLQ (0xFF7F) is instead of just . Similarly, the minimum 3-octet VLQ (0x808000) has a value of instead of zero, which means that the maximum 3-octet VLQ (0xFFFF7F) is instead of just .
In this way, there is one and only one encoding of each integer, making this a base-128 bijective numeration.
Here is a worked-out example for the decimal number 137:
Another way to look at this is to represent the value in base-128 and then set the MSB of all but the last base-128 digit to 1.
The Standard MIDI File format specification gives more examples:[16]
Integer (decimal) | Integer (binary) | Variable-length quantity (binary) | Integer (hexadecimal) | Variable-length quantity (hexadecimal) | |
---|---|---|---|---|---|
00000000 00000000 00000000 00000000 | 00000000 | ||||
00000000 00000000 00000000 01111111 | 01111111 | ||||
00000000 00000000 00000000 10000000 | 10000001 00000000 | ||||
00000000 00000000 00100000 00000000 | 11000000 00000000 | ||||
00000000 00000000 00111111 11111111 | 11111111 01111111 | ||||
00000000 00000000 01000000 00000000 | 10000001 10000000 00000000 | ||||
00000000 00011111 11111111 11111111 | 11111111 11111111 01111111 | ||||
00000000 00100000 00000000 00000000 | 10000001 10000000 10000000 00000000 | ||||
00001000 00000000 00000000 00000000 | 11000000 10000000 10000000 00000000 | ||||
00001111 11111111 11111111 11111111 | 11111111 11111111 11111111 01111111 |