Open Source Software used in research

Especially in scientific research, open source software usage and development has become practically the norm. There’s a number of reasons for this beyond those that apply to the general acceptance of OSS by, for example, consumers, industry, or government. Among these reasons are:

  • Increasingly, algorithms implemented in analysis software form an integral part of the methods described in scholarly publications. As such, it is completely at odds with rigorous peer review if these algorithm implementations are closed to outsiders.
  • Scientific collaboration more often than not spans multiple institutions and distributed research networks where secrecy and command hierarchy is not maintained in a way that is ‘necessary’ for closed source development.
  • Many computational analyses are run in virtualized environments (such as institutional, national, or international ‘cloud’ infrastructures) and hosted on multi-user servers. Closed-source, commercial software often disallows such usage.
  • Open source software development often relies on volunteers. In a time of budgetary constraints for scientific research, this is a clear advantage.

For these and other reasons, open source tools are very commonly used in scientific research. This includes usage in fields where many researchers are amateur developers themselves and rely on tools such as R for statistical analysis and scripting, which, in the last decade, has almost completely displaced commercial software for statistical analysis such as SPSS or JMP in a lot of fields. In fields such as bioinformatics, that involve a lot of file handling of the outputs of DNA sequencing platforms, general purpose scripting languages such as python and commonly used libraries built on top of it (such as biopython) have become a vital part of the toolkit of many researchers.

Tools such as R and python are essentially software for writing software. Although programming is an increasingly common activity among researchers, of course not every scientist does this. One step away from programming is the chaining together of the inputs and outputs of various analysis tools in longer workflows. As an example from genomics, a very common workflow is to start out with high-throughput sequencing reads and then i) do basic quality control checks; ii) map the reads against a reference genome; iii) identify the points where the new data are at variance with the reference. These steps are routinely executed as a workflow where a different open source executable is run in a Linux command-line environment for each of the three steps. Although this is arguably not quite open source software development, it does involve the usage and production of open source artifacts (such as Linux shell scripts) for which the principles that we discuss in this module are applicable.

Lastly, open source software is also used in scientific research for reasons that more closely mirror those that drive the adoption of OSS in wider society, namely that it is cheap. For example, individuals or organizations might decide to switch from Microsoft Office to LibreOffice for manuscript writing or spreadsheet processing because the latter is free (both as in ‘free beer’ and ‘free speech’). Likewise, the choice to switch from ArcGIS to QGIS for the analysis of geographic information might be prompted simply by cost considerations.

Getting Started with OSS - FAQ

I’m using X[e.g. Matlab,STATA,Excel] and I want to transition to something more open. What are the next steps?

Even if you are using proprietary software, you can usually still share your source code/documents etc. The best first step is sharing whatever you can.

Great! I can put them in my new github repo.

If that’s enough for you for now great! If not for most pieces of proprietary software there are Open Source equivalents. Have a go with one and see what you think.

Closed Open
Matlab Python, Julia
STATA/SPSS R
MS Office Libre Office
Mathematica JupyterLab
Test out your new Pull Request -PR- Skills … … by adding your own example here

Cool! But if I make the switch will I be stuck: taking ages to learn a new tool/ without support /with buggy software.

Good question! The answer is it depends. The best thing to do is find someone who’s made the switch before and learn from their experience. Or just do a google search! Some OSS is much better than their closed counterparts, some aren’t, so it’s worth choosing carefully.

Making good software for re-use

The most likely person who might want to re-use your software in the future is…you! So while sharing is always better than not sharing, you can make your own life, and that of others, much easier through appropriate documentation. Documentation can include several things, such as including helpful comments and annotations in the code that help to explain why a particular action was performed, rather than what it is intended to achieve.

One of the most critical aspects of this is including an informative README file, that accompanies almost every OSS project, and some times even more than one. It can be a good practice to include one such file in every directory, that includes a list of files, a table of contents, and what the purpose of the directory is. The README file is typically just plain text or markdown (again, such as all of the ones for the MOOC!), and can include critical information for how to install and run software, previous dependencies and requirements, as well as tutorials or examples.

The purpose here is to provide sufficient information to maximise the re-use and reproducibility of the computational environment, such that someone with no experience with the project can easily access and re-use the software (Sandve et al., 2013). By lowering the barriers to entry, you increase the chances of others being able to re-use your work, which is one of the ultimate goals of OSS (Ince et al., 2012).

An extension of this that can help to make things even easier for future re-use is container technology. Containers are like an ecosystem frozen in time, where the code, the data, any other dependencies, are all packaged and saved in the present functioning versions so that in the future any one can come in and run the analyses again. As such, they are generally good for re-use, but this can come at the sacrifice of modification or understanding by others, as often a lot of details can be hidden within the source code and its dependencies. Common examples of container implementation in research include Rocker (a Docker container for the R language), Binder, and Code Ocean.

Sustainable software is good software.


10 simple rules for reproducible computational research

The 10 simple rules for making computational research more reproducible, based on Sandve et al., (2013), are:

  1. For every result, keep track of how it was produced.
  2. Avoid manual data manipulation steps.
  3. Archive the exact versions of all external programs used.
  4. Version control all custom scripts.
  5. Record all intermediate results, when possible in standardised formats.
  6. For analyses that include randomness, note underlying random seeds.
  7. Always store raw data behind plots.
  8. Generate hierarchical analysis output, allowing layers of increasing detail to be inspected.
  9. Connect textual statements to underlying results.
  10. Provide public access to scripts, runs, and results.

Infographic adapted from Sandve et al., (2013). Feel free to download this to keep handy during your research!

Infographic adapted from Sandve et al., (2013). Feel free to download this to keep handy during your research!


If you follow these steps, along with the processes in Task 1 and Task 2, you should be fine!


Open Source licensing

An Open Source license is a type of license designed specifically for software and code that make it explicit what the legal conditions for sharing and re-use are. As mentioned above, the addition of a suitable license is what differentiates publicly shared software from OSS. For example, the widely used MATLAB is proprietary software, and Octave is an openly licensed alternative programming language.

There are currently more than 1,400 unique Open Source licenses, a complexity born from the difficulty in understanding the differences between the legal implications across different license.

Some of the more common licenses include:

There are two ways in which contributions to a project become licensed:

  1. Explicitly, whereby the individual contribution has a clearly indicated license independent of the main project; or
  2. Implicitly, whereby the contribution falls under the original licensing code of the main project.

Thankfully, the process of selecting an Open Source license is relatively trivial, thanks to user-friendly tools such as Choose A License. Each of these licenses allows other users to use, copy, distribute, and build upon your work, often while ensuring that the creators are appropriately recognised for their work. Here, the key is selecting an appropriate license for your work, depending on what you want, or do not want, others to do with it.


Software citation

Citations provide one of the most important interactions in scholarly research, forming the basis of our referencing and metrics systems. Typically, this is performed thanks to the assistance of a permanent unique identifier such as a Digital Object Identifiers (DOI). A DOI is a persistent identifier, implemented in the Handle System, that meets a common standard, depending on the purpose, such as for identifying academic information. Such identification is critical for tracking the genealogy and provenance of research, for reproducibility, as well as for giving appropriate credit to those who have created the software. Importantly, software should be considered a legitimate output from scholarly research, and citation is becoming an increasingly common way to indicate that.

In 2016, Smith et al., 2016 wrote a research paper about the principles of software citation as part of the FORCE11 Software Citation Working Group. In the same way that you would want to cite software that you have used as part of good research practices, it is important to make your research easily citable too. When citing any software used for your own research, you should include at minimum:

  • The author name(s),
  • Software title,
  • Version number, and
  • The unique identifier/locator (DOI or URL).

The six principles of software citation by Smith et al., (2016) are provided here:

  • Importance: Software should be considered a legitimate and citable product of research. Software citations should be accorded the same importance in the scholarly record as citations of other research products, such as publications and data; they should be included in the metadata of the citing work, for example in the reference list of a journal article, and should not be omitted or separated. Software should be cited on the same basis as any other research product such as a paper or a book, that is, authors should cite the appropriate set of software products just as they cite the appropriate set of papers.

  • Credit and attribution: Software citations should facilitate giving scholarly credit and normative, legal attribution to all contributors to the software, recognizing that a single style or mechanism of attribution may not be applicable to all software.

  • Unique identification: A software citation should include a method for identification that is machine actionable, globally unique, interoperable, and recognized by at least a community of the corresponding domain experts, and preferably by general public researchers.

  • Persistence: Unique identifiers and metadata describing the software and its disposition should persist - even beyond the lifespan of the software they describe.

  • Accessibility: Software citations should facilitate access to the software itself and to its associated metadata, documentation, data, and other materials necessary for both humans and machines to make informed use of the referenced software.

  • Specificity: Software citations should facilitate identification of, and access to, the specific version of software that was used. Software identification should be as specific as necessary, such as using version numbers, revision numbers, or variants such as platforms.

Note: For instructions on ‘how to make your software citable’ see the section Task 2: Linking GitHub and Zenodo.


Using GitHub and Zenodo

GitHub is a popular tool for project management, content storage, and version control. Note that GitHub itself is not OSS. However, Git, the tool which it is based on, is. Git is designed to help manage the source code files, and the updates to them, for a software-related project. However, it can also be extended to other non-software projects; for example, this MOOC!

However, getting research onto GitHub is just the first step. It is equally important to make it persistent and re-usable, which is why having a Digital Object Identifier (DOI) associated with it can be useful. The simplest way to do this is through a service called Zenodo, which is a free and open source multi-disciplinary repository created by OpenAIRE and CERN, and can be used to assign a DOI to individual GitHub repositories. There is a GitHub Guide that explains the details, which involve linking GitHub repositories directly through to Zenodo so that when developers create formal releases for their software, Zenodo creates and archives a that version of the software. There’s nothing special about using Zenodo for creating DOIs, other than its free of cost; other general repositories can also be used, such as DataCite DOI Fabrica, or your own institutional repositories such as Caltech’s.

One of the more popular and useful functions of GitHub is the issue tracker, which is used to organise OSS development.

A lot of researchers might typically be afraid of sharing code which is incomplete, buggy, or imperfect. However, in the OSS community, such a practice of sharing ‘raw’ code is fairly commonplace. Sharing code openly enables others to re-use and improve it, as well as to engage in a deeper way with any research associated with it. This is one of the fundamental aspects of peer-collaboration, perhaps best exemplified by the traditional process of research manuscript peer review.


Previous section:
Next section: