The Human Genome Project (HGP) produced the human reference genome assembly, a database of DNA sequence that represents an example of a full human genome. When researchers sequence human genomes, they compare, or “align”, their results to this reference. While this assembly is one of the most frequently utilized resources in biomedical research, de novo genome assembly remains a significant challenge despite increase in throughput and decrease of sequence cost over the past decade. Alignment of human sequence reads to the reference assembly is a critical aspect of successful data analysis, and several published reports identify regions of the reference assembly that were previously impossible to analyze due to the limitations of the available sequencing technologies, complex genome architecture, missing sequences, and various errors in the assembly or the underlying sequence data.
At the time of publication, the ‘finished’ human genome (NCBI35; GCF_000001405.11) contained 288 assembly gaps. Working to close these gaps, many were determined to be in regions containing structurally variant alleles. Since the reference genome is assembled from sequence information from many donors, it represents a haploid mosaic. For example, in the MAPT region, the allele represented in the assembly was not likely present in any individual human as it had been constructed by mixing the direct and inverted haplotypes present in the RP11 donor.
We can now correct many of the deficiencies in the current human reference by applying the latest advances in sequencing and mapping technologies. One key advance has been the sequencing and finishing of single haplotype human genomes (e.g., CHM1 and CHM13). These data sets can be used to completely resolve both alleles of several genome sequences and improve the GRCh38 reference assembly.
The explosion of clinical genome sequencing requires a human reference genome resource that accurately represents the diversity of the human population, thereby facilitating the identification and characterization of disease- associated variants and somatic events. Although the 1000 Genomes (1KG) project provided a valuable foundation, it is now necessary to select additional representative human genomes, or regions of human genomes, for deep sequencing, assembly, and finishing to high quality and contiguity. These new assemblies should be accessible for the scientific community in the context of the existing reference genome, with improved bioinformatics tools that provide intuitive access to alternate paths, alleles, and haplotypes. In addition, the program will provide greater outreach and educational opportunities aimed at empowering users, especially those who may be relatively new to sequencing technology and analysis, to better use the reference human genome resource in discovery projects and clinical sequencing applications.
We will sequence and assemble at least 5 diploid genomes from individuals selected to maximize human genetic diversity (see table below). All sources chosen thus far have BAC libraries available and, whenever possible, we will use samples from a trio (two parents and child). We will sequence the parents within the trio at a lower depth of coverage to enable haplotype phasing of the proband sequence. The samples selected at this time are one Yoruban (NA19240), one Han Chinese from Beijing (HG00514), one CEPH European (NA12878), one Puerto Rican (HG00733), one Luhya from Webuye, Kenya (NA19434), one Colombian (HG01352), one Gambian (HG02818), and one Kinh from Vietnam (HG02059). Other independent efforts to sequence and assemble new reference genomes include two Japanese, one Malaysian, a Han Chinese and an Ashkenazim trio (as part of the Genome in a Bottle Effort).
We plan to identify and resolve issues (misassemblies, sequence errors, and gaps) within the current reference GRCh38. We will add substantial allelic diversity to the reference to facilitate effective analysis of biomedically important regions across the genome. We will accomplish this by completely finishing (“platinum”) two genomes (CHM1 and CHM13) and performing targeted finishing (“gold”) in additional genomes. We define platinum genome as a contiguous, haplotype-resolved representation of the entire genome. Gold genome is defined as a high-quality, highly contiguous representation of the genome with haplotype resolution of critical regions.
We will engage the bioinformatics community to ensure that the next generation of aligners, variant callers, annotation pipelines, and bioinformatics tools will be capable of interacting with a multi-allelic reference genome. We will facilitate more effective use of the reference for biomedical discovery by providing detailed tutorials of the required complex tool chains. Finally, through the development and deployment of community outreach and education programs, we will convey the importance of the reference as much more than a linear chromosomal assembly.
After long reads are generated from the PacBio, we assemble them using the Falcon algorithm followed by error correction using Quiver. The output of this step is a fasta file of unordered and unoriented contigs. We then align the BioNano genomic map generated from the same individual and clone end sequences (if available) to check for global misassemblies. We make breaks where possible based on these data, and output ordered and oriented contigs based on the map alignments. In addition, there is a file of unaligned contigs. We then use NCBI’s assembly-assembly alignment and chromosome contig generating software to further QC the assembly.
Once the assembly is in ordered and oriented chromosome contigs, we use the NCBI RefSeq gene annotation pipeline, and further annotate with RepeatMasker and Segmental Duplications. After annotation, we can then integrate other data such as Illumina alignments and variant calls, clone based resources, and data from newer technologies such as Dovetail and GemCode to improve the assembly and assess its quality. See http://www.ncbi.nlm.nih.gov/genome/annotation_euk/process/ for more information regarding NCBI pipelines.
|Data Source||Origin of Samples||Quality||Status/Links|
|HG00514||Han Chinese||Gold||Assembly QC|
|HG00733||Puerto Rican||Gold||Assembly QC|