VIETNAMESE GENOME REFERENCE (HG38-VN)

INTRODUCTION

The most recent global reference genomes are from populations in the United States (hg37) and Europe (hg38). Hg37 was derived from 13 anonymous New York volunteers. Individual genomic differences from the hg38 reference genome are currently used in variant call analysis to identify variants. Many variants, however, are expressed in the reference genome as reference alleles (also known as primary alleles) identified based on European progeny populations. In other populations, these common alleles may be rare, resulting in false-positive or miss-calling variants. The accuracy rate of variant analysis of Asian human genomes of genetic origin with little association with hg38 will be low. As a result, it is necessary to first create a VGR 1.0 reference genome that is specific to the Vietnamese population. Many attempts have been made.

PRODUCT DESCRIPTION

In this module, we create the Vietnam Genome Reference using Illumina Whole genome sequencing (WGS - Whole genome sequencing) data from the 1KVG Project. 1KVG data were collected from 1050 healthy Vietnamese individuals with a variety of phenotypes. These people are Kinh, Vietnam's largest ethnic group, accounting for 86% of the Vietnamese population. They are evenly represented in terms of gender and come from three major geographical regions (North 37%, Central 22%, and South 41%). (1:1). Clinical data is meticulously collected and censored to ensure that samples come from healthy people. We currently have access to both raw and processed data held by VinBDI, which is carrying out the 1KVG Project. In addition, we will mine the 99 KHV data.                                                                                                           

Firstly, using 1KVG data, we aimed to customize the current hg38 reference genome. The hg38-vn custom reference genome was able to show accurate reference and alternative allele distribution in the Vietnamese population. The variants found in the 1KVG dataset can also be used to find more associated variants in the Asian population's genomic reference. The expansion is expected to raise the total number of detected variants from 30 million to more than 50 million, including rare variants.  The hg38-VN reference genome can be used as the median output of VGR.                                                                                           

vgr


Second, We intend to create the Vietnam Genome Reference (VGR) using a more cost-effective strategy for de novo assembly, specifically the hybrid approach, based on 1KVG data and single molecule real-time sequencing data (SMRT, PacBio). We will generate enough SMRT data to connect and fill gaps in NGS (Next Generation Sequencing) structures. This will address stretched repeat regions in the human genome and/or identify missing DNA megabases during NGS assembly, resulting in a significantly improved N50 truss and set completeness gene. We will sequence at least ten samples using Pacbio Sequel II at four different flow rates. An actual 8M SMRT cell will generate approximately 20X (70G) coverage of long reads (N50 50kbp), yielding 60-80X data per sample.                                                                                                                                             

Next, A Genome Diagram will be constructed as the next step. Rather than using a single linear genome as a reference, genome plots represent all variants. Once the assembly is complete, we will integrate the variant configuration (as defined in hg38-VN) from the Vietnamese population into the reference. A graph structure for genomic reference has recently been introduced and is gaining traction in the research community. According to recent research, the genomic graph is a reference model. Compact and comprehensive genomes can help to avoid bias toward a single consensus genome and are more robust for variant calling and genomic analyses. Furthermore, this model can provide continuous updating, allowing for the easy incorporation of new sample data. After constructing a genomic graph, it can be used to read the map and identify the variant. Queries will be aligned to a graph structure containing surrogate sequences that have previously been identified. Simultaneously, data from new patterns can be systematically added to the existing chart to complete it. As a key innovation of this project, the reference-based genomic graph structure and variants discovered using 1KVG data will be gradually integrated into the VGR                                                                                                                           

Finally, The VGR gen browser will be published and maintained in a systematic manner. A port information system that is integrated with the browsing gene will be developed and made available to the research community. This platform will allow users to download, compare, search, and manage the regions of interest in the VGR reference generation directly. VinBDI, which currently hosts and maintains 1KVG data, will host and maintain the database and services. Figure 2 depicts a summary of the goal's first points. 

IMG_8833 1

Live version of VGP Genome Browser can be found here