Identification of Novel Diagnostic and Prognostic Biomarker Panels for Cancer Through Big Data Analysis


Current diagnostic and prognostic biomarkers for cancer significantly lack the sensitivity and specificity required for accurate early-stage diagnosis or for accurate prognosis to develop optimized treatment plans. This study aimed to implement a novel biomarker detection method to identify higher-performing biomarker panels in LUAD, GBM, and KIRC.

Instead of isolating candidates solely from high-throughput sequencing, the synergistic use of genome-wide RNA sequencing and RNA interference screens was proposed for isolation of better-performing biomarkers. The reasoning behind the accompanying use of RNA interference data lies in the additional information revealed by these results – if a gene is found to be essential for the survival of cancer cells in addition to its high expression, it may be a better biomarker candidate.

Novel biomarker panels in LUAD, GBM, and KIRC significantly outperforming current prognostic biomarkers and the TNM staging system (LUAD and KIRC only), with average AUC value increases of 22%, 24%, and 28%, respectively, were identified through this method. In addition, the LUAD and KIRC panels presented AUC values of 0.89 and 0.99 when analyzed for performance in diagnosing stage I&II patients, a significant increase (P<0.0001) over current diagnostic biomarkers.

Furthermore, patients characterized by the biomarker panel CFL1-KRT18-GLUL in LUAD was found to exhibit a largely shared resistance to cisplatin treatments.

These panels hold high clinical potential due to the superior performance levels observed and their predicted measurability through blood. In addition, expanded application of this novel method can reveal better-performing panels in other cancers.

Question / Proposal


Biomarkers are being increasingly used for patient diagnosis and prognosis in the clinic. However, most current diagnostic and prognostic biomarkers are still not clinically actionable in diagnosing early-stage tumors or predicting patient outcomes, with high false-positive and false-negative rates. Diagnosis at earlier stages can result in drastic increases in survival rates, and accurate prognosis can guide physicians to build a more effective treatment plan for patients. For instance, the 5-year survival rate for NSCLC patients diagnosed at an early stage is 80% compared to the overall 17.8% 5-year survival rate, and the 5-year survival rate drops from 90% to 12% for KIRC patients diagnosed after metastasis. Therefore, the discovery and implementation of higher-performing biomarkers can immensely improve patient survival rates in cancer.


Thus far, attempts to identify novel biomarkers have largely only focused on the single utilization of high-throughput sequencing. However, this method has been met with modest results and has yet to reveal a highly sensitive and specific biomarker.

The primary purpose of this project was to develop and utilize a novel biomarker identification method to reveal biomarker panels exceeding the performance of current markers in lung adenocarcinoma, glioblastoma, and clear cell renal carcinoma.


I hypothesized that, instead of a sole reliance on high-throughput sequencing to identify biomarker candidates, the combinatory use of RNA sequencing and RNA interference screen data can better identify biomarker candidates. By cross referencing candidates across these two experimental tests, biomarkers with a higher performance than current markers would be uncovered.


Cancer statistics

Despite numerous breakthroughs in the field of cancer research, cancer remains a global leading cause of death [1]. An estimated 1,735,350 new cases of invasive cancer are expected to be diagnosed in the United States alone in 2018, equivalent of upwards of 4,700 diagnoses each day [2]. Of these cases, lung cancer constitutes 234,030 cases, brain cancer constitutes 23,880 cases, and kidney cancer constitutes 65,340 cases. These cancers rank among the lowest in survival rates, with lung cancer, glioblastoma, and metastatic renal carcinoma possessing 5-year survival rates of 17.8%, 3-5%, and 12%, respectively [3,4,5]. Late-stage diagnosis, resulting in a harder-to-treat tumor, and inaccurate prognosis, leading to an ineffective treatment plan, are among the main reasons behind the high mortality rates. With earlier diagnosis and, subsequently, easier and better prognosis, 5-year survival rates rise to 80% and 90% in lung and renal cancer, respectively [5,6]. While most glioblastoma cases are classified as primary, meaning they arise without a known precursor, secondary glioblastomas, which develop from lower grade gliomas, have a significantly better prognosis than their primary counterparts [4].

Current Diagnosis and Prognosis Methods

Currently, biomarkers are used in conjunction with various other methods, including CT scans for lung, brain, and renal cancer, anatomical pathology for lung and renal cancer, and fluorescence bronchoscopy for lung cancer, to diagnose patients [4,7,8]. Diagnostic biomarkers function as an important test to enhance the sensitivity and specificity of imaging tests and tissue biopsies; however, early-stage diagnosis of certain cancers, such as lung cancer, are still rare, mainly due to the limited accuracy of current diagnostic biomarkers. For instance, studies have reported the sensitivities of CYFRA 21-1 and CEACAM, commonly-used diagnostic biomarkers for NSCLC, to be 43% and 31%, respectively, too low for use in clinical settings [9,10]. This is why most diagnostic methods, such as anatomical pathology and fluorescence bronchoscopy, present excellent sensitivity and specificity but exhibit a decreased performance level in diagnosing early-stage tumors [7].

Prognosis prediction of lung and renal cancer is usually mediated by the TNM staging system, and prognosis prediction of glioblastoma is usually done through biomarkers such as MGMT methylation status and IDH1/IDH2 [4,11]. However, accumulating data has shown the lacking performance of the TNM staging system in guiding treatment planning [12]. Similarly, the performance of prognostic biomarkers is still too low for effective use in a clinical setting.

Current Methods of Biomarker Identification

In past studies, novel diagnostic and prognostic biomarkers have largely been identified through a single method of genomic or proteomic profiling coupled with a series of data analyses, including ROC curves for performance determination and stepwise analysis for identification of biomarker panels [13, 14, 15]. While this method has effectively revealed multiple novel biomarkers for clinical use, the majority of these markers fail to translate to the clinic, mainly due to weak clinical performance.


Given the advantages of earlier diagnosis and improved prognosis and the failure of current biomarkers, it is apparent that there is a need for better-performing biomarkers through a novel identification approach.

Method / Testing and Redesign

Gene Pool Establishment

RNA sequencing (RNA-Seq) data for the LUAD, GBM, and KIRC datasets in TCGA were retrieved from cBioPortal. The average expression signals across all samples were calculated, and the top 100 genes exhibiting the highest averages established the initial gene pool. Each initial gene pool was further narrowed down through analysis with RNA interference (RNAi) data available through Project Achilles by Broad Institute. Project Achilles reports RNAi results in log2 fold changes, with a higher negative fold change indicating a greater role for the gene in cell survival. All genes demonstrating a log2 fold change > -0.32 (equivalent to a fold change of 0.75) were eliminated from the initial gene pool.

Identification of Prognostic Biomarkers

  • Patient clinical data was retrieved from TCGA and matched with RNA-Seq data.
  • Patients were divided into high (top 25%) or low (bottom 25%) expression groups.
  • Cox univariate regression revealed genes whose expression significantly inversely correlated with overall survival and/or disease-free survival.
  • Recurrence rates were calculated by dividing progression/recurrence incidences by total sample size, and Fisher’s Exact Test revealed significant direct correlations between expression and recurrence rate.
  • Initial gene pool was further narrowed down to only include genes exhibiting significance in one of the above tests, deemed potential prognostic biomarkers.

Assembly and Performance Testing of Biomarker Panels

  • Original patient sample size was divided into sub-samples based upon tumor stage (LUAD, KIRC) or tumor subtype (GBM).
  • Backward stepwise variable regression on biomarker candidates correlated with either overall survival, disease-free survival, or recurrence rate with default P-value penalty of 0.25 revealed highest-performing biomarker panels in each sub-sample.
  • Performance of identified panels, current prognostic biomarkers for LUAD, GBM, and KIRC, and the TNM staging system (LUAD and KIRC only) were analyzed through ROC curve analysis. Resulting AUC values were recorded.
  • Pairwise comparison analyzed panel performance against either current prognostic biomarker performance or TNM staging system performance. Panels significantly outperforming current prognosis methods (p-value < 0.05) were chosen for further analysis.
  • Panel performance was validated through neural prediction analysis.

Early-Stage Diagnostic Performance of Biomarker Panels

  • TCGA RNA-Seq data for normal samples in LUAD and KIRC datasets were retrieved from Firehose (GBM normal data was unavailable).
  • Normal samples were categorized into high (top 25%) or low (bottom 25%) expression groups, and sample was restricted to include only stage I&II tumors.
  • ROC curve analysis identified performance of biomarker panels and current diagnostic biomarkers in differentiating between early-stage tumors and normal tissue.
  • Findings were verified through neural prediction analysis.

Treatment Response Analysis

ROC curve analyses were implemented to identify shared responses to treatments exhibited by patients characterized by each biomarker panel. This aspect of the study was restricted to response to cisplatin treatment in LUAD patients due to unavailability of data for other treatments/cancers.   


All analyses were conducted using JMP software.


Identification of biomarkers for the survival and progression/recurrence prognosis of patients in each cancer

Each gene in the initial candidate pool was analyzed for correlations with overall survival or disease-free survival prognosis using RNA-Seq and matched clinical data. Utilizing Cox univariate models to assess hazard ratios (HR) between the high (top 25%) and low (bottom 25%) expression groups, 19 genes out of a pool of 134 candidates across each cancer were found to show a significant negative correlation between expression and overall survival, and 17 genes out of the same candidate pool demonstrated a negative correlation between expression and disease-free survival. The HR, 95% Confidence Intervals (CI), and p-values of each potential biomarker are shown in Figure 1A-1B. In addition, recurrence rates for patients in the high or low expression groups were calculated for all genes in the candidate pool to reveal genes whose expression correlated with incidences of tumor progression/recurrence. Genes exhibiting a significant difference in recurrence rate between high and low expression patient groups are shown in Figure 1C.

Formation of novel high-performing prognostic biomarker panels

The new candidate pool, consisting only of potential prognostic biomarkers, was used to compile the best-performing biomarker panels for each tumor stage (LUAD and KIRC) or each subtype (GBM) using stepwise analyses. ROC curve analysis revealed the performance levels of each panel as well as the TNM staging system (LUAD and KIRC) or current prognostic biomarkers (GBM). As the TNM staging system outperformed all biomarkers in LUAD and KIRC in this sample, panel performance was only compared to the TNM staging system. The AUC values of these panels compared to current prognosis methods (red lines) are shown in Figure 2A-2C. 3 panels exhibited an AUC value significantly greater than that of the TNM staging system or current markers, identified through pairwise comparison. To verify the reproducibility of these results, panels were validated through neural prediction analysis.

Performance of selected biomarker panels in early-stage diagnosis is superior to that of current diagnostic biomarkers

To explore the potential of these panels in early-stage diagnosis, RNA-Seq data for normal patient samples were used to alleviate the capabilities of the 3 optimal panels in early-stage diagnosis. ROC curves revealed the strong sensitivity/specificity exhibited by CFL1-KRT18-GLUL in diagnosing Stage I/II LUAD patients and by VIM-HSP90B1-CALR in diagnosing Stage I/II KIRC patients (Figure 3A, 3B), significantly exceeding that of current diagnosis biomarkers in LUAD and KIRC (P < 0.001). CD81-CTSB was unable to be analyzed due to unavailability of normal data in GBM. Verification using neural prediction analysis validated these findings as well.

CFL1-KRT18-GLUL predicts patient response to cisplatin treatment with high sensitivity/specificity

To test whether patients characterized by these panels demonstrated similar responses to treatments, ROC curve analysis using treatment response data revealed a strong correlation between positive testing for the CFL1-KRT18-GLUL panel and resistance to cisplatin treatment (Figure 4, AUC = 0.80). A similar study was unable to be implemented in analyzing the other panels due to a lack of available treatment response data from TCGA.




This study demonstrated the effectiveness of the synergistic utilization and analysis of large-scale RNA-Seq, matching clinical data, and RNAi data in identifying high-performing novel biomarker panels in LUAD, GBM, and KIRC. The panels CFL1-KRT18-GLUL, CD81-CTSB, and VIM-HSP90B1-CALR for stage II LUAD, proneural GBM, and stage II KIRC patients, respectively, demonstrated strong correlations with overall survival or disease progression/recurrence incidence and could accurately predict these outcomes (AUC > 0.70). These panels significantly outperformed current prognostic biomarkers and/or the TNM staging system, and CFL1-KRT18-GLUL and VIM-HSP90B1-CALR were also found to be able to differentiate early-stage LUAD and KIRC tumors, respectively, from normal samples with extremely high accuracy (AUC = 0.89, 0.99). Furthermore, LUAD patients characterized by CFL1-KRT18-GLUL exhibited a shared resistance to cisplatin treatments. Based upon these results, the initial hypothesis was accepted.


Although much progress has been made in decreasing mortality rates and improving prognosis for lung adenocarcinoma, glioblastoma, and metastatic clear cell renal carcinoma patients, these cancers still remain among the deadliest cancers faced by patients. Attempts to integrate protein biomarkers to enhance accurate early-stage diagnosis for easier-to-treat tumors and accurate prognosis for construction of improved specialized treatment plans have been largely met with modest results and only slight upticks in survival rates in these cancers. This is mainly due to the lacking sensitivity and specificity of these markers, with most unable to exceed 60% in either performance category.

This study used a novel biomarker identification method to uncover three novel biomarker panels in LUAD, GBM, and KIRC that may significantly improve upon current diagnostic and prognostic biomarkers. Most of the genes included in the panels encode proteins whose levels have been successfully measured in patient blood samples in past studies, providing potential for development into minimally-invasive diagnostic and prognostic tests. As all biomarkers were also chosen from a pool of genes designated as essential for the survival of their respective cancers through cross reference with the Project Achilles database, these panels can function as targets for the development of novel therapies. Patterns of response to existing treatments, specifically cisplatin, were observed for patients characterized by the CFL1-KRT18-GLUL panel in LUAD. As cisplatin is currently a front-line treatment for most LUAD patients, this information can be crucial for the development of an effective treatment plan for patients. While a more complete analysis of patient treatment responses remains to be done, the preliminary discovery of a connection between CFL1-KRT18-GLUL and cisplatin resistance indicates potential for the use of these identified biomarker panels in improving treatment planning through accurate prediction of patient response to various treatments.

Limitations/Future Directions

  • The reported findings are solely bioinformatics-based; therefore, further study through clinical trials and in-lab experimentation is required to validate the performance of these panels.
  • The initial identification of CFL1-KRT18-GLUL, CD81-CTSB, and VIM-HSP90B1-CALR as biomarker panels possessing large potential for clinical use in threefold applications verifies the capability of combinatory RNA-Seq and RNAi data analysis, which can be further applied to discover similarly functioning biomarker panels in various other cancers.

About me

I am a junior at the Roanoke Valley Governor's School and Cave Spring High School in Roanoke, VA. I enjoy swimming, producing music, and tutoring STEM-related topics and SAT/ACT prep in my free time and aspire to be a physician-scientist in the future. To me, winning the Google Science Fair would mean a lot because it would allow my research to gain wider recognition, stimulating more research endeavors and making a clinical application of my findings a possibility.

Research first drew my interest when I was in kindergarten in Shanghai, China, when I was chosen to develop a project with my grandfather and represent my school at a science fair convention for elementary schools around Shanghai. My project miraculously placed 2nd overall in my division, and since that experience I have been fascinated with conducting research to uncover answers to puzzling questions. After learning of my grandfather's diagnosis with clear cell renal carcinoma earlier this year, I was inspired to apply my fascination with research to the task of improving the outlook for cancer patients, leading to the formulation of the project presented here.

Out of the many scientists I admire, I would say I admire Michael Faraday the most. Not only did Faraday conduct groundbreaking research in the field of electromagnetism, he did so without a formal education and only limited mathematical abilities. His drive and pure passion for science motivates me to not let setbacks or failures discourage me from continuing my research.


Health & Safety

This study was conducted under guidance from Dr. Robin Varghese from The Edward Via College of Osteopathic Medicine Virginia Campus. Email and conference calls were the main modes of contact between myself and my mentor during this study.

Contact Information


As this project was solely computational-based, involving data analysis and machine learning using publicly available datasets, no special health/safety guidelines were required. RNA sequencing and matching patient clinical data was obtained from The Cancer Genome Atlas (TCGA), and RNA interference screen data was obtained from Project Achilles by Broad Institute.

Links to data portals

TCGA Data Portal 1 (cBioPortal)

TCGA Data Portal 2 (Firehose)

Project Achilles Data Portal


Bibliography, references, and acknowledgements


1. Torre LA, Bray F, Siegel RL, Ferlay J, Lortet-Tieulent J, Jemal A. Global cancer statistics, 2012. CA: A Cancer Journal for Clinicians. 2015;65(2):87-108. doi:10.3322/caac.21262.

2. Siegel RL, Miller KD, Jemal A. Cancer statistics, 2018. CA: A Cancer Journal for Clinicians. 2018;68(1):7-30. doi:10.3322/caac.21442.

3. Zappa C, Mousa SA. Non-small cell lung cancer: current treatment and future advances. Translational Lung Cancer Research. 2016;5(3):288-300. doi:10.21037/tlcr.2016.06.07.

4. Davis M. Glioblastoma: Overview of Disease and Treatment. Clinical Journal of Oncology Nursing. 2016;20(5). doi:10.1188/16.cjon.s1.2-8.

5. Atkins MB, Tannir NM. Current and emerging therapies for first-line treatment of metastatic clear cell renal cell carcinoma. Cancer Treatment Reviews. 2018;70:127-137. doi:10.1016/j.ctrv.2018.07.009.

6. Henschke CI, Yankelevitz DF, Libby DM, Pasmantier MW, Smith JP, Miettinen OS. Survival of patients with stage I lung cancer detected on CT screening. N. Engl. J. Med. 2006; 355: 1763–1771.

7. Wang H, Wu S, Zhao L, Zhao J, Liu J, Wang Z. Clinical use of microRNAs as potential non-invasive biomarkers for detecting non-small cell lung cancer: A meta-analysis. Respirology. 2014;20(1):56-65. doi:10.1111/resp.12444.

8. Butz H, Nofech-Mozes R, Ding Q, et al. Exosomal MicroRNAs Are Diagnostic Biomarkers and Can Mediate Cell–Cell Communication in Renal Cell Carcinoma. European Urology Focus. 2016;2(2):210-218. doi:10.1016/j.euf.2015.11.006.

9. Zamay T, Zamay G, Kolovskaya O, et al. Current and Prospective Protein Biomarkers of Lung Cancer. Cancers. 2017;9(12):155. doi:10.3390/cancers9110155.

10. Pastor A, Menendez R, Cremades MJ, Pastor V, Llopis R, Aznar J. Diagnostic value of SCC, CEA and CYFRA 21.1 in lung cancer: a Bayesian analysis. Eur Respir J. 1997;10:603–609.

11. Greene FL, Page DL, Fleming ID, et al. AJCC Cancer Staging Manual. 6th ed. New York, NY: Springer; 2002.

12. Ludwig JA, Weinstein JN. Biomarkers in Cancer Staging, Prognosis and Treatment Selection. Nature Reviews Cancer. 2005;5(11):845-856. doi:10.1038/nrc1739.

13. Xiao GG, Recker RR, Deng H-W. Recent Advances in Proteomics and Cancer Biomarker Discovery. Clinical medicine Oncology. 2008;2. doi:10.4137/cmo.s539.

14. Grund B, Sabin C. Analysis of biomarker data: logs, odds ratios, and receiver operating characteristic curves. Current Opinion in HIV and AIDS. 2010;5(6):473-479. doi:10.1097/coh.0b013e32833ed742.

15. Vradi E, Brannath W, Jaki T, Vonk R. Model selection based on combined penalties for biomarker identification. Journal of Biopharmaceutical Statistics. 2017;28(4):735-749. doi:10.1080/10543406.2017.1378662.

Databases/Data Portals

1. Grossman RL, Heath AP, Ferretti V, et al. Toward a Shared Vision for Cancer Genomic Data. New England Journal of Medicine. 2016;375(12):1109-1112. doi:10.1056/nejmp1607591.

2. Gao J, Aksoy BA, Dogrusoz U, et al. Integrative Analysis of Complex Cancer Genomics and Clinical Profiles Using the cBioPortal. Science Signaling. 2013;6(269). doi:10.1126/scisignal.2004088.

3. Cerami E, Gao J, Dogrusoz U, et al. The cBio Cancer Genomics Portal: An Open Platform for Exploring Multidimensional Cancer Genomics Data. Cancer Discovery. 2012;2(10):960-960. doi:10.1158/

4. Tsherniak A, Vazquez F, Montgomery PG, et al. Defining a Cancer Dependency Map. Cell. July 2017. doi:j.cell.2017.06.010.


All data analyses were conducted using JMP software. JMP 14 Online Documentation was used for guidance in running various analyses with the software.


I would like to sincerely thank my mentor, Dr. Robin Varghese from the Edward Via College of Osteopathic Medicine (VCOM), for his extensive help and guidance with this project. He taught me how to download and work with the large datasets used in this study in JMP and provided direction and encouragement when I encountered barriers in my analyses. Without his advice and feedback, this project would not have materialized.

I would also like to thank my family and teachers at the Roanoke Valley Governor’s School for their support throughout the development of this project.