Ask a bunch of scientists whether they use spreadsheets in their work and you’re bound to touch a nerve. Many have sworn off spreadsheets, others swear by them and some swear profusely when they’re forced to use them.
What makes this epitome of corporate monotony so polarizing? Spreadsheets are broadly accessible, but can cause headaches for the unwary. It’s easy to accidentally overwrite contents in a data cell, for instance, or to select the wrong range for a calculation and end up with incorrect values. Sometimes the software itself can get in the way: for instance, Microsoft Excel is notorious in genetics circles for automatically converting gene names such as OCT4 into dates.
Six tips for better spreadsheets
But that doesn’t mean they’re a lost cause. “There is kind of this belief that a spreadsheet is not a good tool for doing science,” says Yanina Bellini Saibene, “and that is not correct.” Bellini Saibene is a data scientist in Santa Rosa, Argentina, and a community manager for rOpenSci, a non-profit initiative in Berkeley, California, that provides open-source software tools for users of the programming language R. Spreadsheets have a place in researchers’ toolboxes, she says, but, like any tool, their effectiveness depends on how they are used.
Unfortunately, a lack of standardized training has led some researchers to develop bad data-management habits that ultimately undermine the spreadsheet’s ability to be useful, says Karl Broman, a statistician at the University of Wisconsin–Madison, who has written about spreadsheet data management (K. W. Broman & K. H. Woo Am. Stat. 72, 2–10; 2018). “People are creative in the ways in which they will abuse spreadsheets,” he says with a laugh.
The next time you think about using a spreadsheet for research, here are six questions you can ask yourself to make them more effective.
Where are my raw data?
Before doing anything with a data file, make sure you have saved an untouched, clearly labelled version of the raw data — ideally as a read-only file in a separate folder.
“Every time you interact with Excel is an opportunity to accidentally type something extra somewhere and not realize it,” says Broman. When even a single keystroke can surreptitiously overwrite or delete data, he urges researchers to lock down their main data file and not mess around with it at all. Instead, make a copy of the raw data and use the duplicate as your working version.
What do I want to accomplish?
Marla Hertz, a research data management librarian at the University of Alabama at Birmingham, says that many spreadsheet-related problems stem from trying to do all the steps of an analysis in a single file. Each stage of the data’s life cycle — from raw and processed data to analysis and the final figure — serves a distinct purpose and should be kept separate from the other stages.
Obose Eselebor, a research software technician at the Leeds Institute for Data Analytics at the University of Leeds, UK, says that her strategy for data cleaning and analysis depends on her goal, whether that is a quick visualization or a publication-worthy figure. Having a clear objective helps Eselebor to stay focused while cleaning up large, messy data sets. Her suggestion to avoid getting overwhelmed is to tackle issues one at a time, such as standardizing all the date formats or tracking down all the empty cells, before moving on.

Data scientist Yanina Bellini Saibene (left), research data management librarian Marla Hertz (centre) and research software technician Obose Eselebor.Credit: L-R: EMBL Photolab; Lexi Coon/The Board of Trustees of The University of Alabama for The University of Alabama at Birmingham; HDR UK Black Internship Programme
Am I using the right tool?
If the goals are data entry and collection, blank spreadsheets are adequate. But there are better tools that can help to avoid later problems, says Crystal Lewis, a freelance data-management consultant in St Louis, Missouri. Entering data using digital forms, such as Google Forms, Microsoft Forms, REDCap or Qualtrics, can reduce data-input mistakes by giving users a defined list of options, restricting the types of value that can be fed in so data are not put in the wrong place.
For pure data analysis, however, spreadsheets should not be your first choice, advises data scientist Heidi Seibold in Munich, Germany, who is co-executive director of the Digital Research Academy, a network that provides training to improve research quality. Unlike a script written in a programming language such as Python or R, which documents every step of the process and can be saved, versioned and rerun, an analysis that happens inside a spreadsheet using pointing and clicking is hard to follow and even harder to replicate.
Autocorrect errors in Excel still creating genomics headache
Spreadsheets should also not be used if you have a particularly large data set, says Eselebor. She recalls a time when what should have been a quick peek at a spreadsheet with 100,000 rows kept freezing her laptop. “Usually, I switch to either Python or SQL to work with huge files.” (SQL, or Structured Query Language, is a programming language used to manage information in databases.)
Broman agrees that databases can be helpful when things are “getting really big or really complicated, or when you need to have much faster access to the data”, but the process usually calls for some programming skills. If the project is complex enough to require a database, it’s probably a good idea to get programmers involved anyway, he says.
How should I format my sheet?
One of the most important things a researcher can do is to keep their spreadsheet machine-readable. This means formatting it in such a way that a computer program can move through the data logically and process the information in each cell accurately.
“The computer is going to want a rectangle of data, where the rows are individuals or subjects and the columns are measurements,” Broman says. He advises that all the values in a column should contain the same type of data with the same formatting.
Cells that are empty, merged or contain spaces or special characters can also hamper machine readability, as can mixing data types in a cell (such as including both a value and its unit). Either hyphens or underscores can be used to separate words in lieu of spaces, says Ming Tommy Tang, director of bioinformatics at the drug firm AstraZeneca in Waltham, Massachusetts. But he urges users not to use them interchangeably: “Just be consistent.”
How do I explore data responsibly?
If you do any sort of analysis or investigation in Excel, keep each data set in its own worksheet and restrict calculations and visualizations to dedicated tabs, advises Bellini Saibene.