A guide to reference genome selection


Since the completion of the Human Genome Project in 2003, the human reference genome has continued to be updated and refined by the Genome Reference Consortium (GRC), a team of scientists from NCBI, the Wellcome Trust Sanger Institute, the European Bioinformatics Institute (EBI), and the Genome Institute at Washington University. The reference sequence is relied upon by research groups around the world, and this work underpins much of current genetics research. The aim is to provide the most accurate and complete representation of the human genome by correcting errors, closing gaps, finding ways to represent difficult regions such as centromeres, and capturing genetic variation.


Genome Builds and Annotations

There are regular releases of new builds of reference genomes for many species as they are updated and refined (Table 1). There is one central effort for genome assembly, upon which other centres, notably Ensembl and UCSC, add layers of annotation and tools to manipulate the data.

Table 1: Recent releases of the human and mouse reference genomes

Release Name UCSC Version Release Date
Human NCBI Build 34 hg16 July 2003
NCBI Build 35 hg17 May 2004
NCBI Build 36.1 hg18 March 2006
GRCh37 hg19 February 2009
GRCh38 hg38 December 2013
Mouse NCBI Build 36 mm8 February 2006
NCBI Build 37 mm9 July 2007
GRCm38 mm10 December 2011

The UCSC versions have their own naming convention and the terms hg18 or hg19 are often used interchangeably with build 36 or build 37. They are based on the standard NCBI sequence assembly with UCSC annotation tracks aligned to the new build soon after it is released. Although GRCh37 and hg19 are broadly equivalent, there are some differences, including naming conventions, use of 0- or 1-based coordinate system, and the version of mitochondrial sequence included. Tools are available to translate coordinates from one build to another and archived versions of the genome are also accessible. Generally, one may wish to use the latest version of a given species reference genome, except where bioinformatics processing tools or resources are not compatible, or to keep consistency in a long-term project and for comparison purposes. The specific type of analysis or application can also influence the choice of reference genome (see below).

In terms of annotations, Ensembl and UCSC provide genome browsers to view ‘tracks’ corresponding to different types of annotation. A wide variety of information is gathered from experimental data and bioinformatics predictions for curation by dedicated teams, and provide invaluable resources to help interpret genomic data being generated at a relentless pace.


Recent developments in the human reference genome

The extensive genetic variation between individuals revealed through recent initiatives – such as the 1000 Genomes Project – has been captured in the reference genome; the original linear sequence now has ~250 alternative sequences, at >150 loci, to represent common haplotypes. This has been a particular feature of the latest release, GRCh38, and presents both opportunities and challenges – enabling unprecedented characterisation of genetic variation, but requiring new methods that work with that information. Although aligners handle the presence of SNPs quite easily, they were not designed to work with alternate sequences for the same region and so bioinformatics tools have required major development for GRCh38. The benefit of these alternate loci is optimal alignment in regions of high genetic diversity – ignoring them may result in discarding sequenced reads that align perfectly to an alternate sequence or, worse still, aligning them instead to a similar region of the primary reference where they could appear to contain a variant or mutation.

The 1000 Genomes Project (Phase II) also introduced additional sequence to the GRCh37 reference to help reduce false positives for mapping, which is sometimes referred to as hs37d5 or b37+decoy (b37 refers to naming conventions adopted in Phase I that are now widely used for bioinformatics handling of sequencing data). The hs37d5 reference is recommended for optimal read mapping for variant calling, and is therefore suitable for exome and whole-genome sequencing data.

Standardising various aspects of the genome reference sequence and nomenclature, and format for describing sequencing data/alignments, has been an important aspect of the 1000 Genomes project for bioinformatics data processing and analysis tools.


Reference genomes at Oxford Genomics Centre

We maintain a set of reference genomes for several species routinely sequenced at the Oxford Genomics Centre including human, mouse, rat, and bacterial/viral genomes as well as model organisms such as drosophila and zebrafish. We can map to other genomes on request if suitable reference files are supplied, (however note that not all tools used in our pipelines may be compatible with the latest release of a particular genome).

We encourage users to specify their preferred reference genome when the project is set up, and can advise if needed. You may see a naming convention like GRCh37.EBVB.ERCC, which is the standard reference with certain sequences appended (in this case Epstein-Barr virus genes, and ERCC spike-in sequences). These have been created to provide data for projects where spike-ins have been used, or for particular applications where viral gene expression was of interest.

If no genome is specified when booking the project, data will be processed without mapping to a reference genome. FASTQ files containing raw sequence data and quality scores are generated, and the quality of the sequencing run is assessed using various metrics generated during the run itself and QC of the raw data. With a reference genome, mapping is performed producing BAM files (aligned reads) along with additional QC metrics for the mapped data. The default aligner is BWA-MEM (ref) but other aligners are implemented as required e.g. HISAT2 for RNA-Seq data.

Author: Helen Lockstone