Posted by: woodforthetrees | February 19, 2010

Why would I use the Structural Genomics Knowledgebase and not the PDB?

Structural Genomics Knowledgebase

The Structural Genomics Knowledgebase is not a rival to the PDB.

I’m asked this question a lot, and I hope to try and explain it a bit here. It’s worth noting before we start that the Structural Genomics Knowledgebase and the RCSB PDB have the same director, and so they are designed to complement rather than rival.

The PDB, or Protein Data Bank, is primarily an archive that holds information about the three-dimensional structures of biological molecules. It initially held only protein structures, but now it has nucleic acid structures too. It also contains some basic information on the methods used to solve the structure and links to related publications. If you’re asking a specific structural question, then start with the PDB.

The Structural Genomics Knowledgebase is a web portal that brings together information from multiple databases and sources into one place. Instead of having to enter your protein sequence into lots of databases you just enter it once. If you’re asking a general question about a protein then this is the right place to start.

A search of the Structural Genomics Knowledgebase gives you information on:

  • Any structures in the PDB
  • Links to Protopedia and many other databases
  • Any existing protein models (mainly from SwissProt)
  • Whether one of the structural genomics groups is working on the structure
  • How much progress has been made on the structure
  • Detailed cloning, expression and purification protocols
  • Where to get hold of clones containing the gene to express the protein
  • Recent methods that have been developed
  • Information on all the technology developed by the Protein Structure Initiative
  • Structures that have been highlighted by Nature Publishing Group

It’s one of the new generation of data services that I think Don Tapscott and Anthony Williams refer to in their book Wikinomics:

Scientists are using Web services to revolutionize the way they extract and interpret data from different sources, and to create entirely new data services…. Imagine you had the power to weave together all the latest data on [a] species from all the world’s biological databases with just one click. It’s not far-fetched. That power is here, today.

To my mind, the challenges in creating a useful data service are two-fold: one, there are the technical issues of how to share the data; two, there is the question is how best to present so much data without making it overwhelming. Web portals and web services are evolving all the time as we learn how best to present and share data.

Posted by: woodforthetrees | February 17, 2010

Explaining structural genomics to the man in the pub

Is structural genomics like collecting butterflies?

Is structural genomics like collecting butterflies?

It seems remarkable now to think that the first draft of the human genome was announced by US President Bill Clinton and UK Prime Minister Tony Blair. That these too popular politicians – in 2000 Clinton was coming to the end of his term in office but his approval ratings were at a level that his successors can only envy and Blair was still riding high in the polls at that point – wanted to associate themselves with the project is a testament to the skill of those who communicated its aims and achievements.

What will happen at the end of structural genomics? I doubt very much that President Barack Obama or potential Prime Minister David Cameron will even notice, though a interesting opinion piece by Aled Edwards, the head of the Structural Genomics Consortium, writing in Structure envisaged a press conference in 2030 announcing that the problems of climate change and the lack of drinkable water have been conquered thanks largely to the efforts of the US structural genomics group, the Protein Structure Initiative.

Edwards’ optimism for a glorious ending isn’t widely shared. But why was the Human Genome Project so popular and structural genomics much less so? There are several reasons, and the use of metaphor to explain genome sequencing was one of them.

I asked my family (all non-biologists) why the Human Genome Project was important. “That’s easy,” they said, “it’s about us. It about what makes us who we are.” They didn’t say ‘blueprint’ – the metaphor used by Clinton when he announced the first draft – but full marks to the Human Genome Project on communicating their work.

OK then, so what is structural genomics all about? “Er, structures of genomes?” they suggest. What type of structures? “Genome ones”. No, you haven’t got it.

I need a metaphor. I need an explanation that’s going to work for the man in the pub.

Gregory Petsko tried the Rosetta Stone. This stone has three inscriptions, one in hieroglyphs, one in demotic and one in greek and it was the key to solving the puzzle of hieroglyphics. Each inscription was useful, but all three were needed to explain what it meant. In the same way, sequencing of the human genome produced lots of valuable information, but it will only be when we put it together with all the other pieces, such as structural genomics, that it will make sense.

PSI boats are ready

All the structural genomics 'boats' are built and ready to explore the protein structure world. Photo from Navy of Brazil.

A less flattering metaphor came from Thomas Steitz who compared structural genomics to butterfly collecting. While the collection can tell you something about the size and shape of butterflies, it can’t tell you how they fly.

A metaphor that is beginning to surface from within the field is about the ‘dark matter of protein space’. What this means is that now we have solved lots and lots of protein structures we see that certain shapes or ‘folds’ are common but for some reason all the theoretically possible folds aren’t used. As we’re only looking at structures from organisms on Earth (as if that wasn’t enough!) we can’t exclude the possibility that these fold exist elsewhere, and so in an analogy with physics, Willie Taylor, a scientist at NIMR, said, “The universe is very big and, like dark matter, the bulk [of folds] might exist elsewhere.”

This is not going to go down well at The Black Swan.

How do I see the structural genomics researchers? I think they are like the explorers in the Age of Discovery. They spent the first five years building their boats and undertaking a few short journeys to make sure their vessels were seaworthy. And then they set sail for far-off lands. These explorers cannot visit every land, every sea and every mountain. It isn’t important; the important thing is to produce the outline, the map of the world.

No doubt people in the fifteenth and sixteenth centuries thought these journeys utterly pointless. But these discoveries kick-started the modern era, having huge economic impact through trading routes and it was the start of science as we know it.

I think the structural genomics people are in the process of mapping Australia and New Zealand right now.

Posted by: woodforthetrees | February 14, 2010

The PDB wasn’t built in a day: lessons in data sharing

History lessons from the PDB

It has taken nearly 40 years for the Protein Data Bank to evolve into its current form.

‘Climategate’ is raising all sorts of questions about data sharing in science – and more murky suggestions of ‘hiding’ information. It looks pretty bad, but it’s important not to panic and to learn the right lessons from this.

The wrong lesson to learn is ‘don’t use email or other traceable form of communication to discuss your data’. (Take a look at the Chilcot Inquiry into the Iraq war: the surprise was not the conflicting advice from the lawyers on the legality of the war but that it was actually recorded.)

The right lesson is that scientists need to share data better. But this isn’t as simple as it sounds. I think a quick history of the Protein Data Bank (PDB), a highly regarded archive for three-dimensional structure data for biological macromolecules, might be helpful. We need to realise that it took 18 years to get to the point that journals and funding bodies required all protein-structure coordinates to be deposited.

Very soon after the first protein structures were solved, just over 50 years ago, crystallographers realised that they contained useful information that can be used time and time again and that they needed a way to store this information. During the late 1960s and early 1970s, a group of crystallographers got together to work out how best to store this data, and in October 1971 the PDB was announced. This was 10 years before sequence databases such as EMBL-Bank and GenBank.

Scientists at the PDB initially wrote to the authors of every structure paper published, which was easy enough as there were few structures available at that point. They asked for the coordinates of each structure and if the authors obliged, they sent the PDB a set of punch cards. As each atom was on a separate card, we’re talking about hundreds or thousands of cards. Later, the data were submitted on magnetic tape and now by a web form.

Early submission to the PDB was hit and miss: some crystallographers passionately believed in sharing data, others were ambivalent and even hostile. And any hostility was  understandable considering protein structures took years, sometimes decades, to produce. By 1974, the PDB had a grand total of thirteen structures, with four pending.

Now if we fast-forward to the late 1980s, we can see parallels with today’s open science movement and the move to make more data freely available. The PDB had a powerful advocate in Fred Richards, the scientist who produced the third protein structure ever and the first to be solved in the United States. He formed a committee to lobby journal editors to make deposition of coordinates in the PDB a requirement for publication and persuaded the National Institutes of Health to make it a requirement for further funding.

In 1989, Richards’ work paid off and most journals began insisting that coordinates were placed in the PDB before publication, although a hold of up to a year was still possible.

By 1 January 2010, the PDB contained 62,388 sets of coordinates, which flags up another important factor in its success: it began at the beginning. It started when very few structures were available and it was able to develop in a systematic and scalable way.

It is also worth noting that most of the raw data is not stored within the PDB; it contains coordinates, structure factors and some methodological information, although some people choose to include additional information. This means that although the curators at the PDB are able to spot irregularities, like those in the ‘fraudulent’ structures deposited by Krishna Murthy, they do not have access to the diffraction data and electron density to work out what has gone on. More information on these structures and their flaws can be found on the excellent P212121 blog.

Is this an encouraging tale for those at the centre of ‘Climategate’? Probably not. Some of the elements are there, such as pressure from outside agencies to publish the data, but they’ll need a champion from within who sees the need  to make the information public. They will have to contend with how to manage the data now that they have so many years of records: I don’t know how they organise it between themselves now – it may not be as simple as ‘just publish it’.

And finally, a new issue has arisen today from a BBC interview with Professor Phil Jones, the climate-change researcher at the centre of the issue, who says that his management of the raw data was not as good as it could be. It looks like for this field, the raw data will also need to be made available.

Many disciplines are struggling with how to store and organize data, but we can all learn from the slow and steady progress of the PDB.

Posted by: woodforthetrees | February 11, 2010

Who will be science’s Max Clifford?

Max Clifford might have helped Phil Jones

With Max Clifford's help, climate-change scientist Phil Jones might have had better publicity. From wikimedia commons.

I have a confession to make: I admire Max Clifford, the Britsh PR guru to the stars, famous for defending unpopular clients and for selling ‘kiss and tell’ stories to the tabloid papers. He is a very smooth operator and can paint anyone in a good light. Luckily as I solemnly pledged in 1998 not to sleep with David Beckham, after his petulance in the World Cup, I have had no need of Clifford’s services.

But I know someone who could do with his help: Professor Phil Jones from the University of East Anglia, the man at the centre of the hacked emails about climate change. I’m a bit behind the curve on ‘climate gate’ but it seems to me that a large part of the damage to the perception of climate change research could have been mitigated by a vigorous charm offensive in the media as soon as the story broke.

With the help of Clifford, Jones could have toured the studios, given newspaper interviews and participated in photoshoots. Jones would have explained the pressure he was under, put his case forward and said that he was sorry. Then Clifford would have stepped in and persuaded Sun readers that his client only wanted to save the world.

I’m a bit less confident that Clifford could explain the pros and cons of peer review and explain the value of consensus in science. And I’d advise him not to drag Karl Popper into all this – his philosphy of science is probably a red herring in this case and we’re in enough trouble already. But I think Clifford or any good PR adviser could have stopped a bad situation turning into a debacle.

How much does Max Clifford cost? Should we all chip in? You never know who might need him next.

Posted by: woodforthetrees | February 9, 2010

What is structural genomics?

Structural genomics has solved lots of structures

Thousands of protein structures have been solved by structural genomics centres. From wikimedia commons.

According to Wikipedia:

Structural genomics seeks to describe the three-dimensional structure of every protein encoded by a given genome.”

But what does this mean?

The first time I heard about structural genomics it had a much better name, and a more limited scope: it was being called the ‘human protein structure project’. It was 1996 and the human genome project was well underway and it had really caught the public’s imagination. Looking at the proteins encoded by these genes seemed the obvious next step.

I was a PhD student attending a summer school course and we were celebrating the end of the week with a formal meal. David Blow, a renowned scientist who was in the same lab in Cambridge as James Watson and Francis Crick when they published the structure of DNA, was the guest speaker. His view was that it was an extraordinarily exciting time to be in science, and that to really understand human biology we needed a human protein structure project.

So the initial idea was to establish the molecular shape, the three dimensional structure, of every protein in the body. This was a big ask, especially considering at that point we didn’t know how many protein-coding genes there were in the human genome, although the working estimate at the time was 100,000. (We still don’t really know the answer to this, but it looks to be between 20-40,000.)  And if that wasn’t a big enough problem, it didn’t take into account how difficult it can be to solve a protein structure. Years can be lost in the lab trying to produce protein that is suitable to work with.

Around 1997 various pilot projects started, notably in Japan. Along the line the aim of the project grew and grew. Why limit ourselves to one genome when there are so many others?

On a practical level this makes sense. You can spend years struggling with a human proten only to find that a virtually identical one from a pig works like magic. Plus, the human genome was practically the beginning for the field of genomics, now 1000s of genomes have been sequenced. And if we’re hoping to use this information to treat disease, then we’ll need to know what the protein structures of micro-organisms look like and, importantly,  what proteins these bacteria have that we haven’t so that we can make drugs that disable only bacterial ones and not our own.

To solve all the structures of every genome is a Herculean task, partly because of the sheer scale of the task and partly because it is never-ending because we are discovering new organisms and thus new genomes.

In practice, the idea is to solve enough structures, which means thousands of structures, to know what’s out there.  In an ideal world, we’d reach a point that for each new genome that is sequenced, we could say:  ‘Ah, that new gene is very similar to one we’ve seen before in budding yeast, so we can confidently predict that the protein structure will look like this…’

Posted by: woodforthetrees | February 7, 2010

Let’s help the CIA

On Christmas Day 2009, towards the evening, news began to trickle through of an attempt to blow up a plane bound for Detroit. Over the coming days more information surfaced; firstly of the quick thinking  and bravery of the Dutch film-maker Jasper Schuringa, who managed to extinguish the flames apparently coming from the lap of a young Nigerian man named Umar Farouk Abdulmutallab.

But more astonishingly we found out that the father of the alleged bomber had contacted the US embassy in Nigeria to report his son’s erratic and suspicious behavior. As a parent this struck me: how agonising it must have been to inform on the son you love, how sure you’d have to be that lives were at stake to give this information to a government that does not have an exemplary record for fair treatment of suspected terrorists.

And yet this key information was lost or simply not deemed important: Umar Farouk Abdulmutallab was not added to the ‘No fly’ list and he duly boarded the plane from Amsterdam to Detroit.

I heard the news, considered it for a few minutes, then shrugged my shoulders and just got on with normal life.

Two weeks later I flew to Denver for a structural biology meeting. I waited  patiently in the immigration queue with all the other non-US citizens and I dutifully allowed the US government to take copies of my fingerprints and a photograph of me for their files. And it struck me: they just can’t see the wood for the trees.

The millions and millions of fingerprints, photos and emails from  innocent people on file are helping to obscure important data. So the unprecedented access the US government has – and the UK government is hot on its tails – to all sorts of private information is hindering, not helping.

Now ‘big data’ isn’t new to scientists. The human genome project was the watershed, and worldwide more than 1000 genomes have been sequenced. Structural genomics groups have solved the structures of thousands of proteins. Proteomics groups are churning out huge datasets too. As scientists, we’re certainly good at producing data. But how good are we at analysing and using the data?

The idea of this blog is to enlist your help in making sense of all this information. I’m the editor of the Structural Genomics Knowledgebase and also of the Signaling Gateway. I want to explain a bit about these projects, and I’d like your help to understand datasets I can’t get my head around. But first, a question for you.

Imagine you work for the CIA and are reponsible for their database about suspected terrorists. You’ve just been passed a note saying that a Nigerian man has rung in expressing his concerns that his son might be involved in terrorist activity. How would you make sure that this information doesn’t get lost amongst all the information? My guess is, you’d set up your database to weight your information. What would you do?

« Newer Posts