MPEG-G (ISO / IEC 23092) is an ISO/IEC standard designed for genomic information representation by the collaboration of the ISO/IEC JTC 1/SC 29/WG 9 (MPEG) and ISO TC 276 "Biotechnology" Work Group 5. The goal of the standard is to provide interoperable solutions for data storage, access, and protection across different possible implementations for data information generated by high-throughput sequencing machines and their subsequent processing and analysis.[1] [2] The standard is composed of different parts, each one addressing a specific aspect, such as compression, metadata association, Application Programming Interfaces (APIs), and a reference software for data decoding. Together with the reference decoder software, commercial and open source[3] implementations started to be available in 2019, covering progressively more of the published parts of the standard.
The advent of high-throughput sequencing (HTS) technologies has revolutionized the field of quantitative biology. Availability of large collections of genomic information has now entered everyday practice and has become a cornerstone of a number of disciplines, ranging from biological research to personalized medicine in the clinic. At the moment, genomic information is mostly exchanged through a variety of data formats, such as FASTA/FASTQ for unaligned sequencing reads and SAM/BAM/CRAM for aligned reads. The ISO/IEC 23092 (MPEG-G) standard aims to provide a unified format for the efficient representation and compression of such diverse data, both for file storage and data transport. In order to do that, the standard is divided in several parts.
The MPEG-G standard utilizes technology and data representation architectures previously validated in the field of digital media. They allow to compress and transport genome sequencing data even in complex scenarios, for instance when access is needed to large amounts of possibly distributed data, or when part of the data needs to be encrypted for privacy reasons. Conceptually, such requirements lead to the definition of a number of mutually interrelated mechanisms, which are summarized in the following list:
In turn, some of these topic have been collected together, in order to make the standard easier to understand and implement. As a result, the ISO/IEC 23092 standard is physically structured as a series of separate document, as follows:
Part | Number | First public release date (First edition) | Latest public release date (edition) | Latest amend- ment | Title | Description | |
---|---|---|---|---|---|---|---|
Part 1 | ISO/IEC 23092-1 | 2019 | 2019 | Transport and Storage of Genomic Information | Specification of file format, streaming and indexing | ||
Part 2 | ISO/IEC 23092-2 | 2019 | 2019 | Coding of Genomic Information | Compression of unmapped (raw) and aligned genome sequencing data | ||
Part 3 | ISO/IEC 23092-3 | 2020 | 2020 | Metadata and Application Programming Interfaces (APIs) | Specification of standard interfaces, syntax for metadata and description of content protection mechanisms | ||
Part 4 | ISO/IEC 23092-4 | (2020) | Reference Software | It describes the open source implementation of a normative decoder and informative encoder. It also provides compressed bitstreams that can be used for reference purposes. Note that other open source implementations developed by independent groups do exist[4] | |||
Part 5 | ISO/IEC 23092-5 | (2020) | Conformance testing | It details the testing procedure and associated compressed reference bitstreams to be used when one wants to assess the conformance of a decoder implementation with the MPEG-G standard | |||
Part 6 | ISO/IEC 23092-6 | (2021) | Coding of genomic annotations | Compressed representation of genomic annotations — that is, a number of heterogeneous data types associated with intervals of the reference genome that the sequencing data has been aligned to. |
ISO/IEC 23092-1 specifies how the genomic data is organized within MPEG-G structures for transport (i.e., streaming) and storage. Formats of genomic record, reference record, MPEG-G file and transport stream are defined in this part. It introduces Access Unit as the container of the compressed genomic data and provides a reference conversion process among different formats.
ISO/IEC 23092-2 specifies the syntax and methods for MPEG-G lossless compression of sequencing data and lossy compression of associated quality scores. MPEG-G, as is typical for MPEG standards, only specifies the decoding process while the encoding process is left open to algorithmic and implementation-specific innovations. All MPEG-G conformed decoders produce identical outputs from the multiplexed bitstreams included in MPEG-G files and the data streams in streaming scenarios.
The input data of the encoder are genomic records or metadata, with optional reference data, while its output is MPEG-G file or transport streams.
ISO/IEC 23092-3 specifies a metadata format and provides genomic data representation APIs to support interoperability among existing tools and systems. Part 3 specifies how an MPEG-G compliant bitstream can be integrated with metadata as well as mechanisms to implement access control, integrity verification, authentication and authorization mechanisms. This part also contains an informative section devoted to the mapping between SAM and MPEG-G data structures, including backward compatibility with existing SAM content. It defines:
Genomic Information | Functions used to query the structure of, and retrieve, the genomic information coded in a bitstream compliant with ISO/IEC 23092 series. | |
Metadata | Functions used to query the structure of, and retrieve, the metadata associated with the coded genomic data. | |
Protection | Functions used to retrieve the protection metadata associated with the coded genomic data. | |
Reference | Functions used to retrieve the reference associated with a dataset. | |
Statistics | Functions used to retrieve statistics associated with a dataset. |
ISO/IEC 23092-4 specifies genomic information representation reference software, referred to as the genomic model (GM). It consists of two components: the reference encoder software and the reference decoder software. While the reference decoder software is provided to assess the conformance to the requirements of ISO/IEC 23092-1, ISO/IEC 23092-2 and ISO/IEC 23092-6, the reference encoder software serves as a guide for the implementation of the aforementioned standards. The reference encoder software called Genie is an open source software developed by a group of individuals from multiple universities and companies around the world. It features the following components:
Part 1 | ISO/IEC 23092-1 | Encapsulation | ||
Indexing | ||||
Part 2 | ISO/IEC 23092-2 | Classification | ||
Reference engine | ||||
Quality value quantization | ||||
Descriptor subsequence generation | ||||
Transformations | ||||
Entropy encoding | ||||
Part 6 | ISO/IEC 23092-6 | (To be determined) |
ISO/IEC 23092-5 specifies conformance of the coding of genomic information. Part 5 provides a means to test and validate the correct implementation of the MPEG-G technology in different devices and applications to ensure the interoperability among all systems. It specifies a normative procedure to assess conformity to the standard on an exhaustive set of compressed data.
No MIME type (RFC 6838 based IANA media type) currently defined for MPEG-G file.
No conventional file extensions are defined.