It’s easy to become strongly emotionally invested in the programming languages we use. We have spent countless hours developing proficiency, and they let us express complex ideas. Like human languages, we may find it more comfortable or cumbersome to express critical ideas. Even though I am a native French speaker, discussing science in any precise way comes naturally to me in English. This is not a shortcoming of any of the languages! It is merely a reflection of our own affinities with the limited tools with which we express our thoughts.
And this is why, when collaborating on data projects, the precise programming language that any group member uses is entirely irrelevant. I had this discussion on several occasions these past few months, and so I wanted to clarify some points.
In a data project, what we really care about is the data. And although this sounds obvious, we rarely reflect this on our work. Instead, it is often easier to emphasize code and to see data as a by-product of it. This causes potentially high friction, as it makes collaborations in groups where not everyone uses the same language (or the same dialect within a language…) extremely difficult.
But this should not be! Whether a file was produced using Python or Julia or R makes zero difference to its content, as long as the code is correct. It similarly makes zero difference for the next step, which can be in any language we like. Write a plotting system in brainfuck, I don’t care.
Reaching this point is in itself an exciting challenge because it demands to establish trust between group members, and it places the social dimension of collaborative research front and center. This requires that we trust our colleagues to do their thing well and that they similarly trust us. Absent this reciprocal trust, accepting that we will develop work based on a superficial knowledge of some of the parts is unwise.
Working this way has one significant advantage: everyone needs to communicate clearly about data. I am a big fan of the idea that clarifying the data structure for a project should be done as one of the first steps because it ensures that everyone is clear about what information is required, how it is organized, and how the different pieces of data fit together. Developing a data format is actually the first exercise in my machine learning for biology class. When we reach a consensus on the shape that data will have, it becomes easier to develop the pieces of code to assemble the data independently.
Personally, I am excited to see the development of projects where multiple languages are blended because we get all of the advantages of using the right tool to express the right idea. We can similarly increase the number of contributors to a project. This is a fantastic opportunity and one that I am looking forward to exploring more into my own teaching.