Modern Cancer Analytics: a natural language application to a complex biological landscape


Session type:


Jessica Black1,Paul O'Reilly2,Manuel Salto-Tellez2,Darragh McArt2
1Bioinformatics and Integromics Laboratory, CCRCB,2CCRCB, Queen's University Belfast



Many scientists are unfamiliar with the programming and computational paradigms required to analyse large and increasingly complex datasets, resulting in a backlog of data analysis and pressure on bioinformaticians. Additionally, there are numerous analytical pipelines available with no framework available for their integration. The AnalyticsDSL is a domain specific language (DSL) providing an integrative and flexible natural language syntax for the definition of pipelines through parsing various analytical tools.


The DSL has been created using ANTLR v4, a parser generator tool, using scientific terminology and natural language statements to define analytical workflows. Terminology includes commonly used keywords described in literature. Submitted commands are processed into tokens, passed to a syntax tree for recognition, triggering various analytical functions within a Python utilities class. Connected analytical tools include R packages for microarray analysis, or ‘bowtie2’ for RNA-Seq analysis. Analytics are now output through a robust DSL workflow in a single environment, outputting associated metadata which can sculpt materials and methods files for use in publications and archiving.


Initial results from a DSL based microarray analysis demonstrate strong scientific utility. By coupling the ‘GEOBase’ and ‘limma’ R pacakages with the QUADrATiC connectivity mapping platform, gene expression connectivity mapping workflow was defined within a single paragraph. Steps defined detail downloading a specified dataset from GEO, definition of parameters for gene expression analysis, finished with connectivity mapping candidate compound discovery. AnalyticsDSL has removed the need for direct use of R scripts and QUADrATiC via a single input while providing useful data output files.


Initial tests of the DSL for workflow definition show its success by removing the need to use multiple computing paradigms per analysis. The flexibility of ANTLR will also allow for the grammar to be expanded to include future tools and pipelines, meaning this DSL layer can be easily integrated for further use.