ML2R Coding Nuggets Reproducible Machine Learning Experiments

semanticscholar(2022)

引用 0|浏览4
暂无评分
摘要
The scientific areas of artificial intelligence and machine learning are rapidly evolving and their scientific discoveries are drivers of scientific progress in areas ranging from physics or chemistry to life sciences and humanities. But machine learning is facing a reproducibility crisis that is clashing with the core principles of the scientific method: With the growing complexity of methods, it is becoming increasingly difficult to independently reproduce and verify published results and fairly compare methods. One possible remedy is maximal transparency with regard to the design and execution of experiments. For this purpose, best practices for handling machine learning experiments are summarized in this Coding Nugget. In addition, a convenient and simple library for tracking of experimental results, meticulous-ml [17], is being introduced in the final hands-on section. 1 REPRODUCIBLE MACHINE LEARNING The release of data and code was declared a necessary condition for scientific publication [15]. Unfortunately in machine learning research, unpublished code and sensitivity to exact training conditions make many claims hard to verify [8]. Thus, most computational research results presented today at conferences and in publications cannot be verified [3]. Reproduction of ML experiments can be necessary due to biases, scientific diligence, fraud [4] or means of comparison. There is also the assumption that common demand for high publication rates encourages misleading discoveries [20]. It seems that addressing reproducibility issues requires a transformation of publication culture [12]. Therefore, reproducibility in machine learning and data science gained more importance at top conferences in the last decade [1, 8, 18, 21]. There are no defined terms for reproducibility. Some distinguish reproducibility and replicability [12]. Reproducibility focuses on recreating results by using the original code while replication generates results independently from original data [21]. Others differentiate reproducibilty of results and reproducibility of findings [1]. Reproducibility of results relates to the replication of generated numbers. The validity of experimental conclusions is covered by reproducibility of findings. Codesharing is an important practice for improving reproduciblity of results, but insufficient for obtaining reproducibility of findings. Some researchers argue that appropriate experimental design is key for achieving adequate overall model performance and high reproducibility of findings [1, 13, 14]. In this paper, the reproducibility of results will be the main topic. For this purpose, common best practices for codesharing, result-tracking and archiving, and for proper experimental design will be summarized. Furthermore, a solution in form of the python library meticulous-ml created by Ashwin Paranjape will be presented. 2 BEST PRACTICES For ensuring proper reproducibility of results of machine learning experiments, there will be suggestions for some best-practices in the following. Version control of Source Code. All the source code written to execute the experiments should be under version control. The de-facto standard tool for version control is Git [22], although alternatives exist. Code should run from clean Git repositories only, i.e. repositories with no uncommitted changes. This allows to associate the exact status of the code with the experiment run by its unique commit identifier that is provided by the version control software. Consequently, to replicate an experimental result, we can revert the code to the exact version used. Version Control of all Dependencies. Most machine learning experiments rely heavily on software libraries such as Tensorflow or Scikit-learn. These libraries are rapidly evolving and new versions are released frequently. Hence it is important to keep track of which version of libraries was used in an experiment. It is generally recommended to use one environment for each project using tools like Venv or Anaconda for python environments, or Docker [2] on the operating system level, and specify which libraries should be installed using the respective config-files. However, we should not rely on these config-files for ensuring reproduciblity: They often allow vague version constraints (e.g. “newer than v1.2”), and a user can update a library without also editing the environment config-file. Hence we should also capture the software versions at runtime. Version Control of Data. There are less established software solutions for version control of data, though research data management is becoming increasingly important in science and more protocols are established in institutions [24]. For instance, open science data is often published and associated with a unique digital object identifier (DOI). Locally, all scripts that access, filter or preprocess the data should also be under version control. Whenever possible, we should include a script for obtaining the original data, alternatively we can maintain a description of how and where to obtain it. Intermediate data files, e.g. preprocessed data, should have meta data that contains, among others, a timestamp, a reference to the git commit of the preprocessing utility script that has been used, as well as a reference to all the input files that were used to generate the output. Many file formats support attaching meta data. For instance, in .json files or .hdf5 files a new field for meta data can be introduced and .csv files can begin with a comment section. If this is not possible, a metadata file should be stored next to the data file and copied around with the data. L. Pfahler, A. Timmermann, and K. Morik Tracking of all Hyperparameters. Machine learning methods, particularly deep learning approaches, have many hyperparameters and much time is spend tuning these parameters to maximize performance, either manually or automatically. Thus, we need to track all these hyperparameters to replicate the experiment later. When we are interested in reproducing the results of randomized algorithms, it is important to also track the random seeds used for the random generator. Tracking of all Results. We want to archive all the outcomes of an experiment. That includes any metric, including runtime, that we evaluate to judge the quality of a machine learning model, but also other outputs like the model itself or any other result files. It is important that each result or output can be associated with the corresponding experiment. Often it is also useful to capture all console outputs as well as error messages in text files. Reproducibility of Findings. For maximum reproducibility of findings, experiments should be carried out multiple times with various initialization and different environments. These practices result in claims with sufficient statistical significance [7]. For training data, unbiased data in large quantities should be used. Negative outcomes in an experimental setup should also be published [19]. These measures potentially highlight pros and cons of a model and enhance the understanding of how or when it operates superior. Use Experiment Tracking Software. All the above information needs to be tracked, linked and archived. For this purpose, software solutions such as Sacred [6], MLflow [10], Tensorboard, Wandb [23], Theano [11] or Gym [16] exist and should be used. They provide convenient solutions to reproducibility that do not require changing large amounts of code. When we do machine learning experiments on cloud platforms, they often provide their own tools for reproducibility [9]. In the next section, we will present another software solution, meticulous-ml, in greater detail. 3 THE meticulous-ml LIBRARY FOR PYTHON In this section, we present the python library meticulous-ml [17], originally written by Ashwin Paranjape, that supports machine learning researchers by handling many of the requirements for reproducible research established above, while requiring only minimal code changes for existing experiment scripts. Perhaps most importantly, it is, as Hady Elsahar puts it, “suitable for the messy, clueless nature of research” [5]. We see an example for tracking a simple machine learning experiment, training and evaluating a random forest classifier, in Listing 1 and continue to discuss the highlighted, crucial changes to incorporate meticulous. First, to install meticulous-ml, we recommend using pip:
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要