Why you should blog if you are a data scientist

Author

Lucas A. Meyer

Published

February 13, 2019

He also followed it up with an excellent blog post aimed at aspiring data scientists. I think that the most important idea from that post is actually from a presentation he gave. Here’s the key idea, as reported by Amelia McNamara:

In summary, things that you keep to yourself have very little value, and things that you share with the world have a lot of value. I think that’s a very cool idea for the Economics of information. For an extreme case, imagine you have a trade secret that allows you to solve a specific class of problems faster than others. In this case, it’s understandable that you want to keep the trade secret to yourself. But it makes sense to advertise to the world that you can quickly solve some kinds problems faster than others. That makes your trade secret more valuable.

The discipline of writing about what you’re doing

When I was a wee little kid and had just entered college, one of my first classes was “Physics Lab”, and the first class of that was to measure “gravity”, more specifically, the gravitational acceleration . Of course, mathy Computer Science studies that we were, we all knew that the gravitational acceleration would be $$g \approx 9.8 m/s^2$$.

Measuring it, however, it’s not very easy. First, remember that this is in the early 90s in Brazil, so digital cameras were very rare. Part of the problem is that 9.8 meters (approximately 30 feet) is quite high, so if we wanted our experiment to take around 1s, we would need a big ladder. Or, as it happened, we’d need to run an experiment that took less than a second for each run. In a class with 25 students, that was the preferred route.

We set up a vertical track attached to a device that would spark every 1/60 of a second, and we attached grid paper to the track. This is called a Behr free fall apparatus.

We would release the device from the top of the track, and the sparks would mark the grid paper. We would then manually measure the distances between the sparks and the differences between the distances would tell us the acceleration. In theory. Again, remember this is the early 90s, so there’s no Windows 95 or Excel easily available, there were several steps that were prone to error.

The desired outcome was not only to calculate the acceleration due to gravity, but also to generate a lab report.

Writing the lab report

It was a simple experiment, but there was a twist. Another class of 25 students would have to replicate the experiment following the lab report from the first 25. Oh boy. We quickly found out that the best lab reports were the ones in which the experimenter would document the experiment as they were executing the experiment. Another thing that worked was going through the experiment more than once. What did not work was to perform the experiment and then go to where the computers were to type up a report from memory.

I thought that was very insightful, shortly afterwards, we would learn about Donald Knuth’s proposed paradigm of literate programming, which lives on in Jupyter Notebooks and R Markdown and that is very popular in data science today. Behind it is the same concept of explanations interspersed with technical work to produce something that is reproducible and easy to understand.