tbd

2021-04-29 00:00:00 +0000

https://www.mckinsey.com/business-functions/mckinsey-analytics/our-insights/its-time-for-businesses-to-chart-a-course-for-reinforcement-learning

All you need to know about reproducibility in data science as a newbie

2021-03-06 00:00:00 +0000

If you are a wet lab practitioner for analytical chemistry, have you been in a situation that you are unable to re-generate the same results from the last week? Meanwhile, you are in the middle of producing your lab report for the next meeting with all stakeholders. You might want to find out if such inconsistency is due to possible reagent changing without notice, or simply randomness.

Alternatively, you and your colleague are analyzing all the results from the last experiment. You both are working in parallel, but your colleague is unfortunately unable to produce what you had in your conclusion, after running your R script. The conversation can go stale if you can’t reproduce either.

No matter what kind of the scenarios mentioned above, it is all about to improve reproducibility in science. You might be thinking it as that others reproducing your lab results from their end, or it could be yourself reproducing it after a while. Well, those are what microbiology or chemistry subjects called wet labs have taught us. How about in computational science field? Let’s say in data science or machine learning, those buzz words nowadays. Would it be much easier to run the process reproducibly, compared to wet lab settings? How necessary for us to put efforts to improve reproducibility? I’m glad you asked, and this beginner’s guide on reproducibility will have you covered.

Interestingly, when an analysis gets more computational, or data-intensive, the reproducibility becomes harder. It might sound easy to tackle ones and zeros in the computational world, instead of reconducting the experiments in a chemistry lab. However, it does not seem to be the case, although everything is encoded as ones and zeros in a digital form. Some of these issues have been described as articles below:
One
This one
And this one

Irreproducibility can cause research scandals if not handled well, and it can become a very serious issue for the entire enterprise of science, across all fields of science. Then what does this mean for data science in the context of public research, specifically as a beginner? Well, it still plays a paramount role, even if it may not yet be recognized as such. I’d like to decompose this into the followings:

  • Internal stakeholders: whether they are your research supervisor, experiment colleagues, they would require reproducibility at some degree in a timely manner. You will need this to defend your own claim in any of decision-making stage. Meanwhile, you are responsible to reiterate the whole process soundly. In the end, you want people to follow easily, and agree on your conclusions, right? In terms of the communication internally, it includes work transition, or when you need to move onto another job and step out of the whole project. A sound and reproducible work can make your life much easier.
  • External stakeholders: since nowadays business decisions are increasingly based on the results of a series of well-designed experiments, they have to meet some criteria and validity as basic researches do. Then the whole project must be aligned with certain compliance, audits and requirements put by external stakeholders.

At this stage, you should be convinced how critical reproducibility is in data science field, and how irreproducibility can cause trouble, or even crisis into the knowledge building in science world. To make it reproducible, luckily there have been numerous resources. For now, I would like to offer some tips as a beginner’s cookbook towards a new data scientist for good practices.

  • Virtual environment: it would always be a good practice that you have formulated a decent environment file, in which the dependencies are drafted with a certain version. If you are not very familiar with what those are, just think about it as tools with different versions to allow you build up a project, and an environment can be analogous to individual bags filled with tools. If you need more details on this, stay tuned on my next blog!
  • Automation: human copy-pasting can be convenient in some sense, but it could also be the top reason where errors or discrepancies occur. Manual modification and manipulation on the records can cause hidden irreproducibility. To tackle this, it is recommended to implement automation in your project. A makefile is a useful tool to allow you easily chain all commands together, and only updates dependencies when they need to. In the end, one simple command can save your time, and potentially computing cost.
  • Documentation, documentation, and documentation: it can never be emphasized too much the importance of well documentation. No matter it’s about the conversation to the person you work with in parallel, someone you need pass the project to, or most likely can be your future self, well-maintained documentation on the code snippets, workflow etc. will save lots of headaches! It can cover everything, but crucially including concise and precise narratives on how your functions inside your script work, or how yourself or other personnel should run the project with ease. You should also avoid over-documentation, which I will talk more in future blogs. (stay tight!)
  • Storytelling: Eventually, the purpose of a whole data science project, or any general project is to convey some conclusion or rationale towards audience. Reproducibility is also about ensuring the story drawn from the analysis is replicable and persists itself in the minds of stakeholders. Telling a compelling story can achieve this, and very likely people will take further actions on such projects towards knowledge building. This might be more related to interpretability, turning complex mathematics into an actionable and rich experience. I may dive into more in future blogging.

Thanks for reading at this stage. If you had some eye-sour in this blog (I hope not :p), the take-home messages are: Reproducibility is paramount in data science, for not only having others run the code and get the similar results, but for them to reach to the same conclusion and moreover to persist in both computer and human memory.

Phone

Address

Vancouver, BC Canada