Making Big Data available: Informatics Infrastructure supporting the MRC/CRUK funded Stratification in COloRecTal Cancer (S-CORT) programme


Session type:

Andrew Blake1,Enric Domingo1,Michael Youdell1,Susan Richman2,Peter Stewart3,Celina Whalley4,Clare Hardy5,Katerina Chatazipli5,on behalf of S-CORT Consortium6,Tim Maughan1
1University of Oxford,2University of Leeds,3Queens University Belfast,4University of Birmingham,5Wellcome Trust Sanger Institute,6S-CORT Consortium



In the era of personalised medicine Big Data containing prognostic and predictive information drives the discovery process of stratifying patients for therapy. The aims of S-CORT are to develop biomarkers predictive of response to oxaliplatin, radiotherapy, limited surgery and novel therapies in colorectal cancer. This is being achieved by generating and analysing large multi-omic datasets from UK clinical trials and prospective cohorts. The multimodal dataset includes gene expression array data, DNA mutation from gene panel NGS with genome wide copy number, methylation data using Illumina EPIC arrays combined with IHC, pathological data and clinical patient data from 2000 individuals.


In order to ensure the robust, reproducible nature of this dataset several tiers of informatics infrastructure support are required. Secure sample distribution tracking and scientific data upload platforms have been established. Rigorous data quality control pipelines run across the multi-omic dataset to maintain strict data integrity and accuracy. To facilitate data integration approaches and subsequent statistical analysis the dataset undergoes standardisation and harmonisation into a central database.

These tiers utilise open source technologies including R, Mysql and Drupal. The statistical analysis are run on a High-Performance Computing cluster. Data Integration within web platforms (Transmart, cBioPortal) allow for hypothesis generation and testing as well as effective data distribution.


We now have comprehensive multi-omic and digital pathology data from ~500 Patients being analysed for response to Oxaliplatin, ~310 for response to radiotherapy, ~450 investigating prognosis in early disease, ~80 looking for response to Irinotecan and ~160 looking at new therapies as well as combined cohort statistical analysis plans. We have identified variables showing independent prognostic information in one metastatic cohort.


The robust, transparent, reproducible and accessible nature of Big Data implemented by this core infrastructure have ideally placed S-CORT to deliver its aims of driving changes in clinical practice through informed patient stratification.