Pipeline Description¶

This notebook is a simple notebook which reads the pipeline diagram generated through DrawIO and exported into a PDF.

The Raw DrawIO File is available here. The resulting PDF is presented for illustration purposes below.

from pdf2image import convert_from_path


pages = convert_from_path('plot_out/updated-pipeline-diagram.pdf', 500)

pipeline_description = pages[0]
pipeline_description

Reproduces Figure 1 of the Paper:

Description:

The figure contains a flow chart diagram indicating the data processing pipeline presented in section 3.1, 3.2, and 3.3 of the paper. The figure presents the three stages of the pipeline indicated by three large vertically aligned boxes color coded by the stage of the pipeline. The first stage of the pipeline starts with choosing 100K notebooks randomly from the JetBrains dataset over which nbformat is run to check its validity. 99441 notebook which pass the validity check are filtered for python resulting in 92050 python notebooks. These notebooks proceed to the next stage of the pipeline indicated by a yellow box. The code cells from the source code and notebooks which programmatically generate images are selected. The 39540 notebooks extract 342722 images which result in 34 output types, and classification into 28 categories at later stages of the pipeline. The source code from these notebooks are also further analyzed. The final data enrichment stage of the pipeline uses the 100K notebooks from the initial stage and uses nbconvert to export HTML from the notebooks. By applying six themes, we obtain 589746 HTML files over which the aXe and HTMLCS accessibility engines are run resulting in a total 238675580 errors, warnings, and notices. The same HTML files are used to extract heading and table related information.