Do you want to do data analysis reproducibly? Come to The Hacker Within!
The Hacker Within is an informal peer learning group for sharing skills and best practices for research computing and data science. The group meets monthly and invites talks from its members. If you would like to join the mailing list, please email bearsoftware@contacts.bham.ac.uk .
By guest author Flaviu Gostin
(a regular attendee at THW!)
I have been working in research for 12 years. For the last 11 years, I experienced various levels of frustration around doing and managing data processing/analysis. My frustration reached the highest levels during my most recent project in which I had to work with a lot more data and do a lot more analysis than in any previous project. Having almost no programming experience, I was totally dependent on mostly proprietary GUI-based (graphical user interface) software like Excel and others more specialized. No one software was able to do everything I needed, so my analysis was fragmented between several different programs (>5!), which made it very difficult to keep track of and repeat/reproduce my workflow, not to mention the gross inefficiency. I naturally turned to programming to automate some steps. I was reluctant to start learning programming due to lack of time and the preconceived idea that one needs years to be able to program anything useful. However, I choose an easy language called Python and to my surprise I started writing my first useful scripts after just a few weeks. That eased the pain a bit, but I was still far from a pain-free experience.
In fact, in the last one year I gradually discovered that doing the computing part of my data analysis can actually be not only pain-free, but enjoyable! It all started with the first Hacker Within talk by Matthew Brett in March 2018. By the way, hacker does not mean cracker, which is what Hollywood films show. That talk had a huge impact on me. I was introduced to a different world. A world in which data analysis computing is as it should be: simple, reliable, reproducible. Note that simple is not synonymous to easy. GUI-based software like Excel make things very easy, but complicated and also very easy to make mistakes. Catastrophic mistakes: see Mike Croucher’s talk (linked to below).
The command line and a language like Python can be hard at the beginning, but keep things simple and clear. In time, it also becomes easy to work with and extremely efficient. So, after this first talk I switched from Windows to GNU/Linux, tried to use the command line as much as possible and continued to learn Python, when I had the time. In subsequent The Hacker Within talks I learned about version control (I love version control with git), testing, open data practices etc. I was discovering new useful simple tools and methods to dramatically increase reproducibility (computational reproducibility), automation and efficiency borrowed from the software development industry.
It wasn’t until I did my first actual reproducible workflow that I truly felt the full power of these practices: https://github.com/craicrai/xrd_analysis_workflow . Absolutely all data analysis is done using Python commands saved in several script files. Anyone can review every single step. Also, since this workflow is saved digitally (not dispersed in a lab book or worse, just in a researcher’s memory), it can easily be repeated with simple commands and it can be modified and built on in future projects. It’s done, it’s there, I can go back to it anytime and I can get credit for it after sharing it on open data databases like Zenodo.
I can definitely say that THW talks had a strong impact on the way I think about and do the computing side of data analysis. I am very much looking forward to the next THW talks and I strongly recommend joining in to anyone working with data on a regular basis.
By the way, aside from the THW presentations, the two most useful resources for me were this paper and the Software Carpentry website.