ENCODE Data at UCSC Specifications of Common File Formats Used by the ENCODE Consortium

September 2013

The ENCODE consortium uses several file formats to store, display, and disseminate data:

FASTQ[1] is a text-based format for storing nucleotide sequences (reads) and their quality scores. The Sequence Alignment/Mapping (SAM)[2] format is a text-based format for storing read alignments against reference sequences and it is interconvertible with the binary BAM format. The bigWig format is an indexed binary format for rapid display of continuous and dense data in the UCSC Genome Browser. And the bigBed format is also an indexed binary format for rapid display of annotation items such as a linked collection of exons or the binding peaks of a transcription factor.

These file formats were originally designed to be generic and flexible. As the ENCODE consortium is a collaborative effort, the consortium has made several specifications on the file formats to facilitate data archival, presentation, and distribution, as well as integrative analysis on the data. The consortium considers FASTQ as the basic file format for archival purpose and thus the FASTQ format's specifications aim to preserve the raw sequence data. In comparison, the other file formats are geared towards data visualization and dissemination, thus their specifications aim to facilitate user-friendliness.

UCSC Genome Browser ENCODE-specific File Formats
References

Updated 4 Dec 2013

FASTQ: Original Text-based Reads and Quality Scores for Archival Purpose

FASTQ file content

FASTQ Sequencing quality

BAM: Binary Format of Sequence Alignment/Mapping (SAM)

BAM file content

BAM mapping parameters

bigWig: Genome Browser Signal (Wiggle) Files in Indexed Binary Format

bigWig file content

Generation of bigWig files

bigBed: Genome Browser Bed Files in Indexed Binary Format

bigBed file content

References for Common File Formats Used by the ENCODE Consortium

References