Article originally published by Hannes Smarason's blog - 11/12/2017
By Hannes Smarason

Taking on Cancer: Breakthroughs in DNA Sequencing Using TCGA & AI

Geneticists can use the latest AI and Big Data analytics technology to study, diagnose, and even treat cancer.

Cancer is one of the most active fields in genomics, spurring mountains of research papers and clinical trials. WuXi NextCODE is committed to pushing this field forward, and so we had a special “Genomes for Breakfast” session devoted to this topic at the recent ASHG17 event. Panelists addressed our pathbreaking work in how to:

  • Extract impactful findings from the renowned TCGA dataset
  • Get better sequencing results from FFPE samples
  • Apply deep learning to drug discovery, drug repurposing, and identifying subtypes for diagnostics and clinical trials


The Cancer Genome Atlas (TCGA) is one of the most useful public genomic cancer databases available. It has led to critical discoveries, including entirely new drug targets and better insights into tumor origination, development, and spread.

TCGA houses data from approximately 11,000 patients and 33 cancer types. Data types include WES, RNA-Seq, mi RNA, CNV, Methylation array, and clinical sample data. The data is big and complex, and can include multiple samples from one patient, which is crucial to know when doing analyses.

During his ASHG talk, Jim Lund, WuXi NextCODE’s Director of Tumor Product Development, shared some insights into how we put this rich data source to work in concert with our own data and analytical tools, in a process he dubs “multiomics analysis.”

Researchers can search the data on our platform by cancer type, age of diagnosis, sex, ethnicity, year of diagnosis, sample type (e.g. metastatic, new primary), and more.

Multiple pivotal studies using this dataset have already been published, including some examining the prevalence of specific mutations across human cancer types as well as in-depth profiling of specific tumors, such as breast cancer and lung adenocarcinoma.

Layering different types of data, such as reads from DNA and RNA, allows much more accurate detection of features such as variants with allele-specific effects on gene expression. Our user-friendly but sophisticated data interface makes it easier to see such findings. Over the years, our own database and our capabilities have grown exponentially, creating a powerful tool for multiomics cancer research.


In his talk, Dr. Shannon Bailey described how Whole Genome Sequencing (WGS) can be applied to formalin-fixed paraffin-embedded (FFPE) tumor samples, which are stored by the hundreds of thousands in repositories around the world.

Shannon is the Associate Director of our Cancer Genetics division. He pointed out that while these samples are abundant and often paired with extensive clinical and outcome data, there are hurdles to using them for large-scale retrospective studies.

For one thing, the genetic material in such samples can be degraded, crosslinked, or in low quantities. Of all these problems, the biggest issue is getting sufficient quantity of quality DNA for sequencing. Numerous studies have found that these types of samples are difficult to work with and often provide very low success rates for gene sequencing studies. Fresh frozen tissue samples provide much better results, but they are also much harder to obtain.

In response, our team has developed the WuXi NextCODE SeqPlus FFPE extraction method, which provides substantially improved coverage compared to traditional methods and even approximates the results obtained with fresh frozen samples at 10X depth, with similar numbers of heterozygous and homozygous calls.

We tested SeqPlus in a study that comprised 516 tumor-normal pairs (i.e., 1,032 samples) that had been stored for 3 to 6 years. The targeted sequencing depth was 30X for the normal tissue and 70X for tumor tissue. The starting amount of DNA was 400 ng.

The results were excellent, with SeqPlus delivering a coverage analysis just about 1% below what the fresh frozen control samples achieved.

Further, a comparison of our analyses to results from the TCGA, using fresh frozen samples, showed striking similarity. These study results give us confidence that SeqPlus is a new “power tool” for FFPE sequencing studies. This webinar describes the process.

FFPE DNA sample reads

Sequencing reads of a sample prepared by the traditional whole-genome sequencing workflow for fresh-frozen samples and data generated using the SeqPlus whole-genome FFPE method. The center of the image shows a C to A mutation in each of the tumor samples.


Another area of great interest at WuXi NextCODE is artificial intelligence (AI). We have been pioneers in AI for pulling novel insights out of massive multiple datasets. Leading this effort is Tom Chittenden, our Vice President of Statistical Sciences, Founding Director of the Advanced AI Research Labs, and a Lecturer on Pediatrics and Biological Engineering at Harvard Medical School and MIT. He also spoke at the Genomes for Breakfast series.

Our AI capabilities improve the tools we have and expand their capabilities. For example, using our AI tools, we can:

  • Improve functional annotation of missense variants to an accuracy of >99%
  • Integrate multiple types of data to discover new genes and elaborate pathways
  • Improve tumor subtype and drug-response classification accuracy by combining DNA- and RNA-seq, among other data types

These tools can be used for such varied purposes as target discovery, drug repurposing, and defining responders and non-responders in clinical trials.

We’ve already helped to develop breakthrough results, such as identifying an intriguing new target for both cardiovascular and cancer drug discovery. We’ve also classified breast and lung cancer subtypes with 97% to 100% accuracy, classified 8,200 tumors of 22 TCGA cancer types with >99% accuracy, and discovered a completely novel pan-cancer molecular survival signature.

The power of our deepCODE AI tools is in part thanks to a novel, causal statistical-learning method and deep-learning classification strategy. But another advantage is that they were built on our global platform for genomic data, which underpins the majority of the world’s largest genomics efforts and includes all major global reference databases.

Our database stores, manages, and integrates any type of genomic data and correlates it with phenotype, ‘omics’, biology, outcome, and virtually any other type of data that may be relevant to a particular medical challenge.

Filter By:

Recent Posts