Reference Genome Improvement

The Human Genome Project (HGP) produced the human reference genome assembly, a database of DNA sequence that represents an example of a full human genome. When researchers sequence human genomes, they compare, or “align,” their results to this reference. While this assembly is one of the most frequently utilized resources in biomedical research, de novo genome assembly remains a significant challenge despite increase in throughput and decrease of sequence cost over the past decade.

Alignment of human sequence reads to the reference assembly is a critical aspect of successful data analysis, and several published reports identify regions of the reference assembly that were previously impossible to analyze due to the limitations of the available sequencing technologies, complex genome architecture, missing sequences and various errors in the assembly or the underlying sequence data.

Specific aims

We plan to identify and resolve issues (misassemblies, sequence errors and gaps) within the current reference, GRCh38. We will add substantial allelic diversity to the reference to facilitate effective analysis of biomedically important regions across the genome. We will accomplish this by completely finishing (“platinum”) two genomes (CHM1 and CHM13) and performing targeted finishing (“gold”) in additional genomes. We define platinum genome as a contiguous, haplotype-resolved representation of the entire genome. Gold genome is defined as a high-quality, highly contiguous representation of the genome with haplotype resolution of critical regions.

Gold Genome: A high-quality, highly contiguous representation of the genome with haplotype resolution of critical regions.

Platinum Genome: A contiguous, haplotype-resolved representation of the entire genome.

We will engage the bioinformatics community to ensure that the next generation of aligners, variant callers, annotation pipelines and bioinformatics tools will be capable of interacting with a multi-allelic reference genome. We will facilitate more effective use of the reference for biomedical discovery by providing detailed tutorials of the required complex tool chains. Finally, through the development and deployment of community outreach and education programs, we will convey the importance of the reference as much more than a linear chromosomal assembly.

Assembly and analysis details

After long reads are generated from the PacBio, we assemble them using the Falcon algorithm followed by error correction using Quiver. The output of this step is a fasta file of unordered and unoriented contigs. We then align the BioNano genomic map generated from the same individual and clone end sequences (if available) to check for global misassemblies. We make breaks where possible based on these data, and output ordered and oriented contigs based on the map alignments. In addition, there is a file of unaligned contigs. We then use NCBI’s assembly-assembly alignment and chromosome contig generating software to further QC the assembly.

Once the assembly is in ordered and oriented chromosome contigs, we use the NCBI RefSeq gene annotation pipeline, and further annotate with RepeatMasker and Segmental Duplications. After annotation, we can then integrate other data such as Illumina alignments and variant calls, clone based resources and data from newer technologies such as Dovetail and GemCode to improve the assembly and assess its quality.

Assembly Accession
NA19240 Yoruban GCA_001524155.2 PRJNA288807
HG00514 Han Chinese GCA_002180035.1 PRJNA300843
NA12878 European GCA_002077035.1 PRJNA323611
HG00733 Puerto Rican GCA_002208065.1 PRJNA300840
HG01352 Colombian GCA_002209525.1 PRJNA339719
NA19434 Luhya GCA_002872155.1 PRJNA385272
HG02059 Kinh-Vietnamese GCA_003070785.1 PRJNA339726
HG03486 Mende GCA_003086635.1 PRJNA438669
HG02818 Gambian GCA_003574075.1 PRJNA339722
HG03807 Bengali GCA_003601015.1 PRJNA490190

More information

Related people