The PDB wasn’t built in a day: lessons in data sharing

History lessons from the PDB

It has taken nearly 40 years for the Protein Data Bank to evolve into its current form.

‘Climategate’ is raising all sorts of questions about data sharing in science – and more murky suggestions of ‘hiding’ information. It looks pretty bad, but it’s important not to panic and to learn the right lessons from this.

The wrong lesson to learn is ‘don’t use email or other traceable form of communication to discuss your data’. (Take a look at the Chilcot Inquiry into the Iraq war: the surprise was not the conflicting advice from the lawyers on the legality of the war but that it was actually recorded.)

The right lesson is that scientists need to share data better. But this isn’t as simple as it sounds. I think a quick history of the Protein Data Bank (PDB), a highly regarded archive for three-dimensional structure data for biological macromolecules, might be helpful. We need to realise that it took 18 years to get to the point that journals and funding bodies required all protein-structure coordinates to be deposited.

Very soon after the first protein structures were solved, just over 50 years ago, crystallographers realised that they contained useful information that can be used time and time again and that they needed a way to store this information. During the late 1960s and early 1970s, a group of crystallographers got together to work out how best to store this data, and in October 1971 the PDB was announced. This was 10 years before sequence databases such as EMBL-Bank and GenBank.

Scientists at the PDB initially wrote to the authors of every structure paper published, which was easy enough as there were few structures available at that point. They asked for the coordinates of each structure and if the authors obliged, they sent the PDB a set of punch cards. As each atom was on a separate card, we’re talking about hundreds or thousands of cards. Later, the data were submitted on magnetic tape and now by a web form.

Early submission to the PDB was hit and miss: some crystallographers passionately believed in sharing data, others were ambivalent and even hostile. And any hostility was  understandable considering protein structures took years, sometimes decades, to produce. By 1974, the PDB had a grand total of thirteen structures, with four pending.

Now if we fast-forward to the late 1980s, we can see parallels with today’s open science movement and the move to make more data freely available. The PDB had a powerful advocate in Fred Richards, the scientist who produced the third protein structure ever and the first to be solved in the United States. He formed a committee to lobby journal editors to make deposition of coordinates in the PDB a requirement for publication and persuaded the National Institutes of Health to make it a requirement for further funding.

In 1989, Richards’ work paid off and most journals began insisting that coordinates were placed in the PDB before publication, although a hold of up to a year was still possible.

By 1 January 2010, the PDB contained 62,388 sets of coordinates, which flags up another important factor in its success: it began at the beginning. It started when very few structures were available and it was able to develop in a systematic and scalable way.

It is also worth noting that most of the raw data is not stored within the PDB; it contains coordinates, structure factors and some methodological information, although some people choose to include additional information. This means that although the curators at the PDB are able to spot irregularities, like those in the ‘fraudulent’ structures deposited by Krishna Murthy, they do not have access to the diffraction data and electron density to work out what has gone on. More information on these structures and their flaws can be found on the excellent P212121 blog.

Is this an encouraging tale for those at the centre of ‘Climategate’? Probably not. Some of the elements are there, such as pressure from outside agencies to publish the data, but they’ll need a champion from within who sees the need  to make the information public. They will have to contend with how to manage the data now that they have so many years of records: I don’t know how they organise it between themselves now – it may not be as simple as ‘just publish it’.

And finally, a new issue has arisen today from a BBC interview with Professor Phil Jones, the climate-change researcher at the centre of the issue, who says that his management of the raw data was not as good as it could be. It looks like for this field, the raw data will also need to be made available.

Many disciplines are struggling with how to store and organize data, but we can all learn from the slow and steady progress of the PDB.

