Originally developed for the 1000 Genomes Project , the VCF specification has become the de facto standard output for variant calling software due to its concise format and the increase of sequencing data generated from the Next Generation Sequencing (NGS) methods.
As described in the specification for the Variant Call Format (VCF), there are 3 main sections to each file:
Meta Information Lines - Multiple lines prefixed by double pound symbols (##).
Header Line - Single line prefixed with a one pound symbol (#).
Data Lines - Remainder of the file with 1 position per line.
The Meta section describes the format and content of that specific VCF file. This can include information about the sequencing performed, the variant calling software, or the reference genome used for determining variants. The first few rows from the VCF specification demonstrate this type of information:
This Meta section also declares and describes the fields provided at both the site-level (INFO) and sample-level (FORMAT) in the Data Lines. Below are some examples of each type from the VCF specification document:
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
This design allows for great flexibility in the data represented by any given VCF file, allowing each variant calling pipeline to capture the most accurate data and metadata appropriate possible.
However, this flexibility comes at a cost because downstream processing software may need to account for differences in output formats. At GenomOncology, where we integrate with a variety of DNA sequencers and variant callers, we have invested in making our VCF processing software highly configurable to quickly adapt to new VCF formats that we may encounter.
Each VCF file has a single header line that has 8 mandatory fields separated by tabs that represent columns for each data line:
#CHROM POS ID REF ALT QUAL FILTER INFO
If there is genotype data, then a FORMAT column is declared and followed by unique sample names. All of these column names must be separated by tabs, as well.
Each data line represents a position in the genome. The data corresponds to the columns specified in the header and must be separated by tabs and ended with a newline.
Below are the columns and their expected values. In all cases, MISSING values should be represented by a dot (‘.’).
#CHROM - Chromosome identifier. Examples include 7, chr7, X or chrX.
POS - Reference position. Sorted numerically in ascending order by chromosome.
ID - Unique identifiers separated by semicolons. No whitespaces allowed.
REF - Reference base (ACGT). Insertions can be represented by a dot.
ALT - Comma-separated Alternate base(s) (ACGT). Deletions represented by a dot.
QUAL - Quality score that is on a log scale. 100 means 1 in 10^10 chance of error.
FILTER - Indicates which filters have failed (semicolon-separated), PASS or MISSING.
INFO - Site-level (non-sample) information in semicolon separated name-value format.
FORMAT - Sample-level field name declarations separated by semicolons.
<SAMPLE DATA> - Sample-level field data separated by semicolons corresponding to FORMAT field declarations.
Example Data Explanation
The specification includes an example VCF file.
Position and Ref/Alt Information
Below are some notes to help understand the first 5 columns about the above file.
All of the variants occur on Chromosome 20 on the NCBI36 (hg18).
There are 5 positions identified (14370, 18330, 1110696, 1230237, 1234567).
Three of the variants have IDs including 2 dbSNP records (rs6054257, rs6040355).
The first two positions (14370, 17330) are simple single-base pair substitutions.
The third position has 2 alternate alleles specified (G and T) that replace the ref (A).
The fourth position represents a deletion of a T since the alt allele is missing (“.”).
The fifth row has 2 alt alleles, the first is a deletion of TC and second is insertion of a T.
QUAL and FILTER columns
The QUAL column indicates the quality level of the data at that site. The FILTER column designates what filters can be applied. The 2nd row (position 17330), has triggered the q10 filter, which is described in the meta section as “Quality below 10”.
Each bioinformatics pipeline treats these columns differently, so you will need to consult your pipeline’s subject matter experts on how to best interpret this information.
The info column includes position-level information for that data row and can be thought as aggregate data that includes all of the sample-level information specified.
The format column specifies the sample-level fields to expect under each sample. Each row has the same format fields (GT, GQ, DP, and HQ) except for the last row which does not have HQ.
Each of these fields is described in the Meta section as the following:
GT (Genotype) indicates which alleles separated by / (unphased) or | (phased).
GQ is Genotype Quality which is a single integer.
DP is Read Depth which is a single integer.
HQ is Haplotype Quality and has 2 integers separated by a comma.
Sample and Genotype Information
This VCF file has 3 samples identified by their names (NA00001, NA00002, NA00003) in columns 10 through 12. Below are the relevant columns for each of the samples.
 Variant Call Format Specification
 The variant call format and VCFtools https://academic.oup.com/bioinformatics/article/27/15/2156/402296