A large-scale study about quality and reproducibility of jupyter notebooks

Joao Felipe Pimentel, Leonardo Murta, Vanessa Braganholo, Juliana Freire

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Jupyter Notebooks have been widely adopted by many different communities, both in science and industry. They support the creation of literate programming documents that combine code, text, and execution results with visualizations and all sorts of rich media. The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourage poor coding practices, and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we studied 1.4 million notebooks from GitHub. We present a detailed analysis of their characteristics that impact reproducibility. We also propose a set of best practices that can improve the rate of reproducibility and discuss open challenges that require further research and development.

Original languageEnglish (US)
Title of host publicationProceedings - 2019 IEEE/ACM 16th International Conference on Mining Software Repositories, MSR 2019
PublisherIEEE Computer Society
Pages507-517
Number of pages11
ISBN (Electronic)9781728134123
DOIs
StatePublished - May 1 2019
Event16th IEEE/ACM International Conference on Mining Software Repositories, MSR 2019 - Montreal, Canada
Duration: May 26 2019May 27 2019

Publication series

NameIEEE International Working Conference on Mining Software Repositories
Volume2019-May
ISSN (Print)2160-1852
ISSN (Electronic)2160-1860

Conference

Conference16th IEEE/ACM International Conference on Mining Software Repositories, MSR 2019
CountryCanada
CityMontreal
Period5/26/195/27/19

Fingerprint

Visualization
Industry

Keywords

  • Github
  • Jupyter notebook
  • Reproducibility

ASJC Scopus subject areas

  • Computer Science Applications
  • Software

Cite this

Pimentel, J. F., Murta, L., Braganholo, V., & Freire, J. (2019). A large-scale study about quality and reproducibility of jupyter notebooks. In Proceedings - 2019 IEEE/ACM 16th International Conference on Mining Software Repositories, MSR 2019 (pp. 507-517). [8816763] (IEEE International Working Conference on Mining Software Repositories; Vol. 2019-May). IEEE Computer Society. https://doi.org/10.1109/MSR.2019.00077

A large-scale study about quality and reproducibility of jupyter notebooks. / Pimentel, Joao Felipe; Murta, Leonardo; Braganholo, Vanessa; Freire, Juliana.

Proceedings - 2019 IEEE/ACM 16th International Conference on Mining Software Repositories, MSR 2019. IEEE Computer Society, 2019. p. 507-517 8816763 (IEEE International Working Conference on Mining Software Repositories; Vol. 2019-May).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Pimentel, JF, Murta, L, Braganholo, V & Freire, J 2019, A large-scale study about quality and reproducibility of jupyter notebooks. in Proceedings - 2019 IEEE/ACM 16th International Conference on Mining Software Repositories, MSR 2019., 8816763, IEEE International Working Conference on Mining Software Repositories, vol. 2019-May, IEEE Computer Society, pp. 507-517, 16th IEEE/ACM International Conference on Mining Software Repositories, MSR 2019, Montreal, Canada, 5/26/19. https://doi.org/10.1109/MSR.2019.00077
Pimentel JF, Murta L, Braganholo V, Freire J. A large-scale study about quality and reproducibility of jupyter notebooks. In Proceedings - 2019 IEEE/ACM 16th International Conference on Mining Software Repositories, MSR 2019. IEEE Computer Society. 2019. p. 507-517. 8816763. (IEEE International Working Conference on Mining Software Repositories). https://doi.org/10.1109/MSR.2019.00077
Pimentel, Joao Felipe ; Murta, Leonardo ; Braganholo, Vanessa ; Freire, Juliana. / A large-scale study about quality and reproducibility of jupyter notebooks. Proceedings - 2019 IEEE/ACM 16th International Conference on Mining Software Repositories, MSR 2019. IEEE Computer Society, 2019. pp. 507-517 (IEEE International Working Conference on Mining Software Repositories).
@inproceedings{87638b4995514ca398cd87e2ac111fcf,
title = "A large-scale study about quality and reproducibility of jupyter notebooks",
abstract = "Jupyter Notebooks have been widely adopted by many different communities, both in science and industry. They support the creation of literate programming documents that combine code, text, and execution results with visualizations and all sorts of rich media. The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourage poor coding practices, and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we studied 1.4 million notebooks from GitHub. We present a detailed analysis of their characteristics that impact reproducibility. We also propose a set of best practices that can improve the rate of reproducibility and discuss open challenges that require further research and development.",
keywords = "Github, Jupyter notebook, Reproducibility",
author = "Pimentel, {Joao Felipe} and Leonardo Murta and Vanessa Braganholo and Juliana Freire",
year = "2019",
month = "5",
day = "1",
doi = "10.1109/MSR.2019.00077",
language = "English (US)",
series = "IEEE International Working Conference on Mining Software Repositories",
publisher = "IEEE Computer Society",
pages = "507--517",
booktitle = "Proceedings - 2019 IEEE/ACM 16th International Conference on Mining Software Repositories, MSR 2019",

}

TY - GEN

T1 - A large-scale study about quality and reproducibility of jupyter notebooks

AU - Pimentel, Joao Felipe

AU - Murta, Leonardo

AU - Braganholo, Vanessa

AU - Freire, Juliana

PY - 2019/5/1

Y1 - 2019/5/1

N2 - Jupyter Notebooks have been widely adopted by many different communities, both in science and industry. They support the creation of literate programming documents that combine code, text, and execution results with visualizations and all sorts of rich media. The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourage poor coding practices, and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we studied 1.4 million notebooks from GitHub. We present a detailed analysis of their characteristics that impact reproducibility. We also propose a set of best practices that can improve the rate of reproducibility and discuss open challenges that require further research and development.

AB - Jupyter Notebooks have been widely adopted by many different communities, both in science and industry. They support the creation of literate programming documents that combine code, text, and execution results with visualizations and all sorts of rich media. The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourage poor coding practices, and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we studied 1.4 million notebooks from GitHub. We present a detailed analysis of their characteristics that impact reproducibility. We also propose a set of best practices that can improve the rate of reproducibility and discuss open challenges that require further research and development.

KW - Github

KW - Jupyter notebook

KW - Reproducibility

UR - http://www.scopus.com/inward/record.url?scp=85072330312&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85072330312&partnerID=8YFLogxK

U2 - 10.1109/MSR.2019.00077

DO - 10.1109/MSR.2019.00077

M3 - Conference contribution

AN - SCOPUS:85072330312

T3 - IEEE International Working Conference on Mining Software Repositories

SP - 507

EP - 517

BT - Proceedings - 2019 IEEE/ACM 16th International Conference on Mining Software Repositories, MSR 2019

PB - IEEE Computer Society

ER -