Faking it: Building a Simulacrum of non-identifiable modelled cancer data to support research


Year:

Session type:

Theme:

Cong Chen1,Christopher Drennan2,Brian Shand3,Sally Vernon3,Georgios Lyratzopoulos4,Sophie Newbound3,Jem Rashbass3,Peter Treasure3,Matthew Williams5
1Public Health England, National Cancer Registration and Analysis Service (PHE NCRAS),2Public Health England,3PHE NCRAS,4PHE NCRAS & University College London,5Imperial

Abstract

Background

The National Cancer Registry contains data about 15 million cancer patients, recording over 200,000 diagnosed tumours each year. The data is confidential and access is stringently controlled by the Office for Data Release. The Simulacrum project aims to create a simulated dataset which matches the real datasets as closely as possible, to make cancer data more widely accessible.

Method

We tested key feature variables in cancer data for 2014 statistically for independence, and inferred associations otherwise. Guided by these linkages we sampled from distributions given by real cancer data to produce tumour-level data. This presentation discusses: the tools developed to test the data for realism, methods used to ensure preservation of research-relevant statistical features, steps taken to limit disclosivity risk.

Results

The datasets produced replicate the shape and quality constraints of real cancer data. In testing, low-dimensional statistics correspond closely to those for real-world data – incidence and age profiles by cancer site, stage distribution. Modelling preserves key multidimensional characteristics of the data, such as the influence of age on stage, which are automatically identified from strong correlations in the original cancer registry data set. This correspondence in shape and distributions means that queries run on the test data may expect similar results to queries run on the simulated data, and are also compatible with the real data without significant modifications.

Conclusion

The creation of a non-identifying modelled dataset removes a huge obstacle for research on cancer data. It should support estimates of data quality and size of cohorts, or exploration before detailed investigation. This dataset is a valuable resource for academic and commercial researchers to build their case for further data access and provides a proof of concept for modelling of other cancer datasets.