Darwin Core Archive Explained

Darwin Core Archive (DwC-A) is a biodiversity informatics data standard that makes use of the Darwin Core terms to produce a single, self-contained dataset for species occurrence, checklist, sampling event or material sample data. Essentially it is a set of text (CSV) files with a simple descriptor (meta.xml) to inform others how your files are organized. The format is defined in the Darwin Core Text Guidelines.^[1] It is the preferred format for publishing data to the GBIF network.

__TOC__

Darwin Core

The Darwin Core standard^[2] has been used to mobilize the vast majority of specimen occurrence and observational records within the GBIF network.^[3] The Darwin Core standard was originally conceived to facilitate the discovery, retrieval, and integration of information about modern biological specimens, their spatio-temporal occurrence, and their supporting evidence housed in collections (physical or digital).

The Darwin Core today is broader in scope. It aims to provide a stable, standard reference for sharing information on biological diversity. As a glossary of terms, the Darwin Core provides stable semantic definitions with the goal of being maximally reusable in a variety of contexts. This means that Darwin Core may still be used in the same way it has historically been used, but may also serve as the basis for building more complex exchange formats, while still ensuring interoperability through a common set of terms.

Archive format

The central idea of an archive is that its data files are logically arranged in a star-like manner, with one core data file surrounded by any number of ’extensions’. Each extension record (or ‘extension file row’) points to a record in the core file; in this way, zero to many extension records can exist for each single core record, a more space-efficient method for data transfer than the alternative of including all the data within a single table which could otherwise contain many empty cells.

Details about recommended extensions can be found in their respective subsections and will be extensively documented in the GBIF registry, which will catalogue all available extensions.

Sharing entire datasets instead of using pageable web services like DiGIR and TAPIR allows much simpler and more efficient data transfer. For example, retrieving 260,000 records via TAPIR takes about nine hours, issuing 1,300 http requests to transfer 500 MB of XML-formatted data. The exact same dataset, encoded as DwC-A and zipped, becomes a 3 MB file. Therefore, GBIF highly recommends compressing an archive using ZIP or GZIP when generating a DwC-A.

An archive requires stable identifiers for core records, but not for extensions. For any kind of shared data it is therefore necessary to have some sort of local record identifiers. It's good practice to maintain – with the original data – identifiers that are stable over time and are not being reused after the record is deleted. If you can, please provide globally unique identifiers instead of local ones.

Archive descriptor

To be completed.

Dataset metadata

A Darwin Core Archive should contain a file containing metadata describing the whole dataset. The Ecological Metadata Language (EML) is the most common format for this, but simple Dublin Core files are being used too.

External links

Notes and References

http://rs.tdwg.org/dwc/terms/guides/text/ Darwin Core Text Guidelines
Wieczorek. John. D. Bloom . R. Guralnick . S. Blum . M. Döring . R. De Giovanni . T. Robertson . D. Vieglais . Darwin Core: An Evolving Community-developed Biodiversity Data Standard.. . 2012. 7. 1. e29715. 10.1371/journal.pone.0029715. 22238640. 3253084. 2012PLoSO...729715W. free.
https://github.com/gbif/ipt/wiki/DwCAHowToGuide Darwin Core Archives – How-to Guide