Data collection and analysis has always been a key component of the scientifc process. In recent years, although an increasingly large volume of data science code has become available online these artifacts have remained largely un-analyzed at scale as processing them requires domain-specifc expertise.
CORAL is a weakly supervised transformer architecture designed to "understand" computer code and answer previously unresolvable questions about scientific analysis.
CORAL provides an extendable framework and accompanying Python library for embedding arbitrary code snippets for static analysis.
We present a corpus of 100 expert-annotated notebooks and an accompanying task for labeling computational notebook cells as stages in the data analysis process
Abstract: Large scale analysis of computational notebooks holds the promise of better understanding the data science process, quantifying differences between scientific domains, and providing insights to the builders of scientific toolkits. However, such notebooks have remained largely unanalyzed at scale, as labels are absent and require expert domain knowledge to generate. We present a new classification task for labeling computational notebook cells as stages in the data analysis process (i.e., data import, wrangling, exploration, modeling, and evaluation). For this task, we propose a novel weakly supervised transformer architecture for computing joint representations of data science code from both abstract syntax trees and natural language annotations. We show that our model, leveraging only easily-available weak supervision, achieves a 35% increase in accuracy over expert-supplied heuristics. Our model enables us to examine a set of 118,000 Jupyter Notebooks to uncover common data analysis patterns. In the largest analysis of scientific code to date, we relate our public dataset of notebooks to a large corpus of academic articles, finding that notebook characteristics significantly correlate with the citation count of corresponding papers.
All code used in this publication are available on GitHub.
The work was supported by The National Science Foundation (NSF-IIS Large #1901386): Analysis Engineering for Robust End-to-End Data Science.