SNP and genotype calling

2. VCF and BCF format

One of the most popular text-file formats for storing genetic variation information is the Variant Call Format (VCF)). There are different versions of the format, but the core elements are the same. You can find a full description of the latest iteration here.Due to the large amount of data usually involved, files tend to be stored compressed. BCF is the binary version of VCF, which keeps the same information, but compressed and indexed. It is designed to be much more efficient to process and store large amounts of data. The practical downside is that the contents can only be accessed with bcftools view.

The VCF format is composed of meta-information lines (prefixed wih ##), a header line (prefixed with #), and data lines each containing information about a position in the genome and genotype information on samples for each position (text fields separated by tabs). There are 8 fixed field lines, usually followed by an additional FORMAT field and an arbitrary number of sample fields when genotypes are included.

Table 1 Description of VCF fields

Column Number Title Description
1 CHROM Chromosome/scaffold of the reference genome
2 POS Position on the chromosome/scaffold given in CHROM (1-based)
3 ID ID for the SNP/INDEL (Here always ‘.’)
4 REF Reference allele (usually the allele on reference genome). I can have several bases in the case of indels.
5 ALT Alternate alleles (separated by commas), can have several bases in the case of indels
6 QUAL Variant quality score (phred-scale)
7 FILTER Filter status (‘.’ if unfiltered or hard-filtered, PASS or a semicolon-separated list of codes for filters that fail if soft-filtered)
8 INFO List of variant annotations delimited by ‘;’ and as specified in the meta-information lines
9 FORMAT Describes the format for the sample genotype information in column 10 onward
10 to end Sample id (from SM tag in BAM file RG string) Sample genotype information. Usually it stores the genotype call (GT), genotype likelihoods (GL or PL (phred-scaled)), depth (DP), allelic depth (AD), and genotype quality (GQ).

Some notes about genotype fields:

A graphical explanation of the different parts and fields can be found here.

Let’s now have a look at our files:

bcftools view bcftools/bcftools-Hmel201001.bcf | less -S

The file starts with a few lines specifying the VCF and bcftools versions, the command used with bcftools mpileup, followed by a long list of the >700 scaffolds that comprise the genome assembly (starting with ##contig), the descriptions of the fields for ALT, INFO, and FORMAT, and any additional commands that were executed to produce this file (bcftools call in this case).

To focus only on the header or the SNPs, the flags -h and -H can be used:

bcftools view -h bcftools/bcftools-Hmel201001.bcf | less -S
bcftools view -H bcftools/bcftools-Hmel201001.bcf | less -S

Back to TOC