CORAL: COde RepresentAtion Learning

Data collection and analysis has always been a key component of the scientifc process. In recent years, although an increasingly large volume of data science code has become available online these artifacts have remained largely un-analyzed at scale as processing them requires domain-specifc expertise.

CORAL is a weakly supervised transformer architecture designed to "understand" computer code and answer previously unresolvable questions about scientific analysis.

What does CORAL provide?

Nueral Network

Weakly Supervised Transformer for Calculating Code Embeddings

CORAL provides an extendable framework and accompanying Python library for embedding arbitrary code snippets for static analysis.


Benchmark Evaluation Task

We present a corpus of 100 expert-annotated notebooks and an accompanying task for labeling computational notebook cells as stages in the data analysis process



Abstract: Large scale analysis of computational notebooks holds the promise of better understanding the data science process, quantifying differences between scientific domains, and providing insights to the builders of scientific toolkits. However, such notebooks have remained largely unanalyzed at scale, as labels are absent and require expert domain knowledge to generate. We present a new classification task for labeling computational notebook cells as stages in the data analysis process (i.e., data import, wrangling, exploration, modeling, and evaluation). For this task, we propose a novel weakly supervised transformer architecture for computing joint representations of data science code from both abstract syntax trees and natural language annotations. We show that our model, leveraging only easily-available weak supervision, achieves a 35% increase in accuracy over expert-supplied heuristics. Our model enables us to examine a set of 118,000 Jupyter Notebooks to uncover common data analysis patterns. In the largest analysis of scientific code to date, we relate our public dataset of notebooks to a large corpus of academic articles, finding that notebook characteristics significantly correlate with the citation count of corresponding papers.

Data and Code


  • Expert-Annotated Jupyter Notebooks: Available on GitHub.
  • GORC Corpus: Available on here on AI2's GitHub repository. Additional code for pre-processing in the context of CORAL can be found on our project's repo.
  • UCSD Jupyter Notebook Corpus: Available from the orginal researchers here.

All code used in this publication are available on GitHub.


The work was supported by The National Science Foundation (NSF-IIS Large #1901386): Analysis Engineering for Robust End-to-End Data Science.