by John Zhang
For many scientists who code, whether it be for data crunching or prototyping, reliability, efficiency and readability are often not always a priority. After all, the goal is to produce results for publication. It’s tempting to cut corners on code quality when a paper deadline loom but increasingly, this is changing. There are growing demands for source code in publications to be made available for verification. Releasing software may also lead to increases in citations as it breaks down a major barrier to reuse – not to mention the inherent benefits of maintaining good code such as improving re-usability and reliability (for yourself and others) and easier debugging.
I invite you to consider the following suggestions from the world of software engineering. Following them may prevent mistakes and make your life easier when collaborators wish to build upon your results or when you are revisiting your old experiments while preparing your thesis. If you are a PhD student considering postgraduate options, these suggestions may also be helpful if you should consider a career in the software industry. Many companies are eager to hire developers with PhDs to help them pursue a hybrid approach to research, where the distinction between engineer and research scientist is blurred. Being at least aware of good coding habits can help you avoid the reputation of academics as sloppy coders.
Backup Your Code with Version Control
Version control systems such as SVN and Git (the flavor du jour) can help scientists back up their work. Think of them as Dropbox (although these came first) for programmers. They maintain different versions of your work, force you to document your changes as you save (“commit”) them and allow for easy sharing (“cloning”) and creation of multiple versions (“forking”). Because coding may produce automatically generated files on the side, or changes may happen in waves, using these systems is usually considerably easier and more efficient than creating manual backups or simply saving code in Dropbox. The software itself is usually free (both Git and SVN are), and some free services are available for hosting your code online (e.g., GitHub or BitBucket). Note that there are usually limits to the usage of these hosting services, so be wary when signing up. For instance, GitHub’s free services are usually preferred for free and open-source software.
Maintain Good Coding Style
Maintaining good coding style is usually an investment that pays off large dividends in the long run: for instance, making it easier for other people to reuse your code, or for yourself when you are revisiting it with modifications prior to publication. While consistent naming of variables and functions, documentation, and spacing of text may not seem important in the moment, they often play a significant role later on in helping readers of the code. When possible, a group of collaborators should follow a guide. The source of the guide (whether it is publicly established or team-specific) is less important than consistency. Style guides for C++, Java and Python are publicly available. Following good style conventions can also help catch mistakes and mitigate errors. Recently, a high-profile security vulnerability was exposed in Apple’s OSX operating system. While this bug should have been caught via proper testing, its dangerous effects could have been neutralized via proper coding style. Speaking of proper testing…
Test Your Code
Systematically testing and verifying your code is another time-consuming but worthwhile endeavor. It is a practice in software engineering that can’t be remotely summarized in a short paragraph. Writing automated tests (usually black box testing—i.e., the code itself is not examined but instead the output from a given input is matched against the expected output) in general is good practice, as it will allow you to quickly test frequently-changing code. Another good general tip is simply to write code that is as modular as possible, so that they may be tested independently, i.e., by testing each module of code independently in a practice called unit testing, for which there are a number of tools available such as CppUnit (for C++) and J-Unit (for Java). Unit testing is useful even for small software projects. If you are participating in a large software project, a closer examination of testing methodologies would be strongly recommended.
Hone Your Knowledge of Algorithms and Data Structures
Depending on your field of study, you may or may not want to devote extra time to understanding the algorithms and data structures used in computer science. The difference is often significant time savings, if not the successful outcome of the experiment itself. As an example, scientists who write simulations often need to simulate random sampling without replacement*. A naive approach would be to simply pick a random number, check it against a list of already drawn numbers, and repeat until a sufficient number of samples have been drawn. While this would work for small sample sizes, it quickly becomes prohibitively slow as the sample sizes grow (the study of algorithm run times is a branch of computer science called complexity analysis). However, a much faster approach would be to simply enumerate all samples and then shuffle them — a task for which a much faster algorithm exists (intuitively, think of shuffling a deck of cards versus drawing cards at random with replacement until all cards have been drawn at least once). For those with good familiarity with programming, many resources are available online to help develop and hone programming skills. Some examples include TopCoder tutorials and the UVa Online Judge, which contains tutorials, practice problems and even online judges to evaluate the solutions to those problems. An added benefit: companies such as TopCoder, Google and Facebook often hold programming contests with large cash prizes, as well as giving significant advantages to top performers in these contests during hiring.
Where to Begin
So, you’ve got some messy code lying around. The best way to improve is to introduce these practices little by little in your daily routine. Start by committing the code for the project you’re currently working on (here‘s a nice step-by-step guide). Now, the next time you’re programming, give names to your variables that are more intuitive. For parts of the code that are not very simple, include many comments – you will thank yourself profusely when you’re looking at the same code some weeks later. Commit the code again once significant changes have been introduced. With these little improvements, you are already better off than 80% of scientists out there.
For a more serious programmer, various recommended reading lists are available online. It can also be helpful to participate in online forums, e.g., Hacker News or Reddit’s r/programming. Perhaps most helpful of all would be to learn by doing, such as taking on personal projects or interning at a tech company with a strong engineering culture.
The list of suggestions given here is of course far from complete. A large number of general software engineering practices would be hugely beneficial to anyone who codes, particularly scientists. The investment of time versus potential benefits is a decision that would need to be made on a case-by-case basis. However, keeping these practices in mind may save a lot of time and headache, as well as advantages to those who may be considering software engineering as a possible career after graduation.
John Zhang is a 2008 fellow of the Fulbright Science & Technology Award Program, from Canada, and a Software Engineer at Google, working on Maps.
The March edition of TGS magazine was guest edited by Puripant Ruchikachorn, a 2010 fellow of the Fulbright Science & Technology Award, from Thailand, and a PhD Candidate in Computer Science at Stony Brook University.
*This example is actually drawn from personal experience, where the author helped a chemist optimize his code resulting in significant speed gains.