Posted by: woodforthetrees | March 30, 2010

How to build a bad biological database

Storing data is a simple task isn’t it? Memory is relatively cheap and after all data wants to be free doesn’t it? How hard can it be? Here are my ten tips for building a terrible database.

TIP ONE            Make submission difficult
Scientists are smart people so there’s no need to bother wasting time and money on usability issues. Eventually they’ll figure out how to get it right and at least some of the important information will be submitted. Who cares if everyone submits to a rival database, just because it’s easier to use?

TIP TWO          Have a support service that is available 9-5 Mon to Fri GMT
After all scientists are renown for working 9-5 and the only science that matters is in Europe… isn’t it?

TIP THREE     Don’t let your file formats interconvert
Under no circumstances should your data from one piece of equipment in a specific file format be converted into a common and searchable one, or even be read without proprietary software. Particularly, ignore the pioneering work of the open microscopy environment: format standardisation is for wimps.

TIP FOUR         Keep your database independent
Stand out from the crowd by ensuring your data do not link to other databases. Who wants their data to be found via a sequence search on GenBank or through links from UniProt? Data wants to be free but it doesn’t necessarily want to be found.

TIP FIVE           Totally trust your automated systems
Books can be ordered on Amazon without any manual intervention so why would it be needed on a database? Most of the well known biological databases have curators who check the submissions, ensuring that they are complete and accurate as far as possible. What a waste of money – nobody minds incomplete data sets, missing experimental conditions etc.

TIP SIX               Do not provide a permanent, unique identifier
The PDB uses identities (e.g. 1ubq) and the Gene Expression Ominibus uses an accession number – as do many other databases, but this looks like another hassle you don’t need. We all need a good place to bury bad data.

TIP SEVEN        Make sure reviewers can’t see raw data
Don’t devise a simple way for journal reviewers to check data that is part of paper going through peer review. Reviewers LOVE to receive emails with thousands of huge images attached.

TIP EIGHT        Include a 44-page getting started guide
Scientist have lots of spare time and are very keen to read through a 44-page quick-start guide to your database because you’ve followed tip one and ensured that the database is very difficult to use. Even better, provide at least a 50-page guide for reviewers. The only people less busy than your submitters are the reviewers. It’s a well known fact.

TIP NINE          If you include a search option, make sure it only works in UK English
Or in US English, but certainly not in both. People foolish enough to search for crystallization and not crystallisation don’t deserve to find anything in a database.

TIP TEN             Do not develop good visualisation tools
Scientists love data. Pages and pages and pages of it. Making it simple to see connections between different datasets would just make it too simple. Scientist love a challenge.

Posted by: woodforthetrees | March 24, 2010

Structural data falls under a woman’s influence

Helen Berman is my nomination for an outstanding female scientist for Ada Lovelace Day. Picture from Wikipedia.

It’s Ada Lovelace Day and I’m going to try something brave. I want to write about a woman in science whom I admire, but as I work with her on one of her projects it’s a bit daunting. I very much respect her and I would be delighted to achieve just a fraction of what she has. But she has a reputation for being, how shall I say this, exacting. She’s Helen M Berman and she’s the director of the RCSB PDB.

Her story is intimately connected with the history of the PDB. She was there at the Cold Spring Harbor Symposium in 1971 when the PDB was born, when Helen was aged only 28, and has worked with this database ever since, becoming the director of the RCSB PDB in 1998.

Before taking up the directorship, she researched primarily on nucleic acid and protein-nucleic acid complexes, but more and more Helen kept thinking that if the data from structural biology could be stored in a systematic way then all sorts of data-mining opportunities would be possible. This was more than 20 years ago: it was way, way before such topics became fashionable.

Helen’s organised approach and very high standards have made the PDB the highly regarded database that it is today. I don’t know if she can program computers like Ada Lovelace but she can certainly organise programmers. And she combined these high standard and achievements with having a child and, for the US — (life’s a bit different here in Europe) — she is unusually sympathetic to the demands of juggling childcare and science.

One of the remarkable things about Helen is that her life has been devoted to service within science rather than, as some might call it, doing real science. By concentrating on the infrastructure her contribution, I believe, is much greater than if she had just run her own lab, even a very successful one.

A woman’s touch has made the PDB probably the most famous science database in the world.

Posted by: woodforthetrees | March 22, 2010

Making a PML map
I’ve been to a few conferences recently and I’ve witnessed a divide opening up between the scientists that use high-throughput methods and everybody else. This I think is partly because although large datasets look impressive, we’re just not sure what it all means yet. Some researchers have even said to me that the interest of some of the top journals in publishing large datasets is simply because they lead to good citations, and help the impact factor. A recent paper on the ‘PML interactome’, which I describe below, is a nice example of how assembling the data in one place gives a very good overview of the situation and provides some functional clues too.

The protein PML (shown in red) clusters in dots, known as PML nuclear bodies.

Mysterious bodies

On the left is a picture of a slice through a human cell with a protein called PML coloured in red. The notable thing is that PML clusters in dots, known as PML nuclear bodies. We’ve known about these nuclear bodies for 50 years but, to be honest, we’re still not totally sure what the point of them is, although we’ve got some good theories.

These nuclear bodies contain many other proteins too. For example, in the picture below the protein SUMO-1 is coloured in green and PML is in red. When the green of SUMO-1 and the red of PML overlap, they produce a yellow colour, indicating that they are in the same place in the cell.

The same cell as above has now got SUMO-1 protein coloured in green and PML in red. Where they overlap, they produce a yellow colour.

During my PhD, I photographed many proteins overlapping with PML, and in the time that has elapsed since I finished my PhD, hundreds of papers have been published on these nuclear bodies.

Building a network map

Oddly enough, no one seems to have sat down and made a list of all the proteins known to be in these nuclear bodies — until now. A team from Belgium has now produced the first ‘interactome’ — a map of all the 166 proteins that are known to interact with PML in these nuclear bodies.

They used information from protein interaction databases and they carried out a large literature search. Using the software Cytoscape they produced a network map:

A map showing all the proteins in the PML nuclear body. From J. Int. Biol. Sci.

But what can we actually learn from this network?

It highlights the 70 interactions that are available in the literature but have not been included in standard protein interaction databases.

All the proteins required to add SUMO to a protein and remove it again are present in these bodies.

38% of all the PML interaction partners have been reported to be modified by SUMO. And database search by the authors suggest that this will rise to at least 56%.

It allows a proteins to be sorted on the basis of their function (based on UniProt function keywords).

Overwhelmingly the function that dominates is transcriptional regulation. Other functions are also prominent, including apoptosis (cell death), viral infection and post-translational modification proteins. These are not necessarily mutually exclusive functions but it does give some interesting pointers to follow up.

Functions of the proteins in the PML nuclear body. From Int. J. Biol. Sci.

More than anything else, this network indicates the complexity of these bodies. At the moment, this network does not even capture the dynamic nature of these bodies: proteins  move in and out of these bodies all the time. The only one that is always there is PML. (And yet, odder still, mice that don’t have any PML still seem to be perfectly normal so this doesn’t really give us a clue to their function.)

And what if you think there is a protein missing? The authors invite you to email them at ellen.vandamme‹at› and they’ll add the information as soon as possible.  The network itself can be downloaded.

I think this is a really interesting and potentially very useful way of visualising these mysterious PML nuclear bodies.

Van Damme E, Laukens K, Dang TH, & Van Ostade X (2010). A manually curated network of the PML nuclear body interactome reveals an important role for PML-NBs in SUMOylation dynamics. International journal of biological sciences, 6 (1), 51-67 PMID: 20087442

Note: Intro added 23/3/10

Posted by: woodforthetrees | March 20, 2010

DNA night with the Guides

To celebrate National Science and Engineering Week, I ran a DNA night for 10th Harpenden Guides. (Girl Guides are known as Girl Scouts in other countries). I thought explaining DNA base-pairing to 30 10-15 year olds on a Friday evening might be hard work but it wasn’t: the Guides thought it was a brilliant night – and so did I.

Most of the activities were inspired by Duncan Hull’s O’Really? blog, describing the European Bioinformatics Institute’s activities at the recent Cambridge Science Festival.

We started with a warm-up activity drawing portaits of themselves, their mum and their dad. From this we could talk about inheritance – that if both you and your mum have blue eyes, but your dad doesn’t, chances are you get your blue eyes from your mum.

Then we moved onto DNA origami. The best tip I can give you is to make sure that you do the lines as sharply as you can, otherwise it makes the folding difficult at the end. The result is pretty impressive and really shows the helical shape.

We did some DNA sequencing using beads next. The forward direction for the sequence was quite simple, but doing the reverse sequence was quite hard to explain but it was worth perservering with because it made the next activity easier.

Midget gem DNA was the highlight of the evening. This was a variation on yummy gummy DNA owing to a lack of gummy bears! Midget gems worked out cheaper anyway. I was very impressed. We kept this until last because there is nothing like sweets to motivate young people, plus the two earlier activities lay the groundwork for this as they already understood how to pair the bases (or pair different colour gems in this case) and they knew the shape they were trying to twist it into at the end.

The whole thing lasted an hour and a half, and all the girls we really absorbed in the tasks.

Now… I need your protein structure activities for next year!

Posted by: woodforthetrees | March 16, 2010

Taming biology databases using widgets

Interrogating a biological database can be a bewildering and frustrating experience. There are so many of them, all using different search terms, interfaces and serving up different data types. If you want to correlate a protein’s structure with its interaction profile and with its cellular location, for example, you’re going to need to open a lot of windows on your computer.

Google Chrome might be able to deal with this browser window nightmare, but is it really the ideal solution? A team from the RCSB Protein Data Bank (PDB) has an alternative – and I think better – solution. They advocate the use of widgets.

What is a widget? The PDB guys describe it as a piece of computer code that can be embedded into a webpage to provide some of the function of the originating site. They are a bit like applets, but simpler. I’m pleased to say that the first example that the paper gives is from the Structural Genomics Knowledgebase and the widget code is here.

Structural Genomics Knowledgebase widget

We really developed it as a way of advertising and linking to the site from the other protein structure initiative (PSI) websites but it can be embedded in any site. The nice thing is that it automatically detects new articles, new structures and new features and takes you straight to them.

The PDB widget is more useful. Using it, you can compare two protein structures or two protein sequences from the PDB without going to the website.

So why do I like these widgets so much? I can see lots of advantages:

  • It ensures that the current version is always used (so no out of date software to deal with)
  • Makes the website’s content accessible to a wider range of users
  • As a user there is no need to maintain your own applications or databases – let someone else do all the work!
  • Simple to use (no need to navigate a complicated website)
  • Eventually you’ll be able to develop your own desktop of widgets you routinely use, creating order out of chaos

There are some downsides, but they aren’t insurmountable:

  • Remote users might not count as visitors to the site – a disadvantage if you’re applying for grants or trying to attract advertising/sponsorship
  • I’m demonstrating one here – if you can’t edit your webpage either because you haven’t got access to the code or because you don’t know how then you can’t embed the widget (I think I can’t edit code for this wordpress-hosted site, but may well be demonstrating the latter point)
  • Missing out on new features from the originating website that might have been helpful. By not exploring the original website, serendipity is lost

This isn’t “new” technology – we’re used to using apps, widgets or gadgets all over the place, but this looks to be the first examples of their use for biology. I think this offers great potential for developing our own personal workbench or desktop, bringing together all the applications and databases your routinely use in one place. Now, what I need is a very simple way to drag and drop them into one place…

Reference: Bourne PE, Beran B, Bi C, Bluhm W, Dunbrack R, et al. (2010) Will Widgets and Semantic Tagging Change Computational Biology? PLoS Comput Biol 6(2): e1000673. doi:10.1371/journal.pcbi.1000673

Posted by: woodforthetrees | March 9, 2010

Robert the Bruce and the membrane proteins

Membrane proteins are difficult

Like a spider spinning a web, obtaining membrane protein structures requires patience. Picture from Wikimedia.

Every Scottish schoolgirl and boy knows the story of King Robert the Bruce I of Scotland (1274–1329) and his many battles to free Scotland from the English. The most famous story, which even featured in our English reading books, is of Bruce hiding in a cave after being defeated in battle. He stayed there for several months feeling truly miserable. He thought about giving up and leaving Scotland.

While there, he saw a spider spinning a web at the entrance of the cave. The spider kept falling down, but climbed right back up again, time and time again, and eventually finished its web. Inspired by the spider, Bruce decided to fight again, and he told his men: “If at first you don’t succeed, try, try and try again”. Eventually he defeats the English and is crowned King of Scotland.

What has this to do with membrane proteins? At a recent meeting two stars of the protein structure world — Ray Stevens and Wayne Hendrickson — talked about their experiences working with these proteins. As membrane proteins account for about 30% of the proteins in a cell and are major drug targets, it is very important to understand them properly. Neither talked about spiders and webs, but it’s the image of Robert the Bruce that came straight to mind during their talks.

Hendrickson leads a structural genomics group, the New York Consortium on Membrane Protein Structure, and this consortium has put eight membrane proteins in the Protein Data Bank,  four of which are unique. Not much you might think for a supposedly high-throughput center with a budget of $15 million over 5 years.

But then he listed the real numbers. They have cloned 8630 targets, expressed 2669 proteins, solubilised 277 and  purified 603 proteins and solved the crystal structure of four unique targets. Quite a long way to go then to reach number 9343 on their target list.

And despite the staggering numbers involved, a reviewer apparently (I haven’t seen the reports) had the audacity to suggest that the structures they have solved must be low-hanging fruit, in other words that it must have been easy, simply because it was a structural genomics consortium involved. In my view, considering the sheer numbers involved, this comment was completely unfair.

Hendrickson threw in a useful tip though: the number one detergent for success in their hands was… beta-octylglucoside. This tidbit was received with the same attention whispering “chocolate” to nursery school is — the whole audience sat up and practically begged for more information.

Next up was Stevens who is part of several high-throughput projects that aim to increase the crystallization of membrane proteins. He works on human G-protein coupled receptors and he’s one of the very elite scientists in this field. He reeled off names of receptors whose structures his lab had solved and mentioned a couple they are finishing off. Stevens is beginning to make obtaining membrane protein structure look easy. But then he added the killer line: he’d been working on this area for 22 years.

If there is one thing that I learnt at that meeting, it’s that membrane protein structures are darned difficult. It’s a case of try, try and try again. And then again. For 8630 times or 22 years, which ever comes soonest.

Posted by: woodforthetrees | March 4, 2010

UK science: the people are our future

Attracting the best of the world to the UK

The UK needs to attract the very best scientists from around the world.

First the good news: the United Kingdom is second only to the United States for research productivity. We produce 13% of the most cited papers and 8% of the world’s publications. We lead the world in biological and social science and are very good at biomedical research.

Now the bad news: the cost of one chemist or one engineer in the US — and presumably in the UK — is the same as five chemists in China or 11 engineers in India. This week alone, 1,200 redundancies have been announced in research and development by the Leicestershire-based pharmaceutical company AstraZeneca.

In the UK we are facing pressure to cut budgets, yet China and India are investing strongly in science and engineering. How on earth are we going to compete — or even survive — in this brave new world?

The Council for Science and Technology presents a stark choice in its report A Vision for UK Research (pdf) between ‘managed or neglected decline’ or investment in excellence and better development of the products of research.

It looks at the strength of our research base (it’s very good but still room for improvement) and our ability to convert this into economic and societal benefits (the results are patchy, but can probably be summed up as not terribly good). In the new terminology of the report, our research base is upstream research and the commercialisation is downstream research.

There are some very important points in this report. The first is that translational, or downstream, research should not come at the cost of a strong research base: it’s just impossible to predict where the next drug, green technology or engineering advance will come from and we limit our options too much by second-guessing the future.

Two examples come to mind here — one is of probably the most famous anti-cancer drug, Herceptin, which targets a protein kinase. Protein kinases were studied in labs for 25 years before anyone showed any commercial interest in them.

The other is of the Human Genome Project. Many people argued at the outset that there was no point sequencing the non-coding DNA (known sometimes as junk DNA) and that everyone should just concentrate only on the DNA that codes for protein. This would have been a mistake: we now know that non-coding DNA contains important regulatory elements, which can be indicative of cancer predisposition, yet we could have missed this if we had concentrated on what was thought to be useful at the time.

The second important point from the report is the emphasis on people rather than specific research areas. Although government should have a high-level role in maintaining a broad base, the studies that are funded should be decided on the strength of the scientist, not so much the project.

This means attracting the best PhD students and the best researchers. We need to encourage outstanding scientists to establish themselves in the UK, and we need to build connections with emerging countries, in the way that Germany and France have begun to do.

And universities will have to think about how attractive they are to the brightest foreign students: there’s a tendency to think of foreign students only in terms of their cash value — they pay higher fees than home students — meaning that exceptional students from poorer background are attracted to the US or other countries where a range of scholarships exist. Instead, the CST proposes a national scholarship scheme.

More than anything else it’s people who are key to taking advantage of our upstream research base. There needs to be more interaction between industry and researchers in universities or institutes, which can be achieved partly by encouraging the best to move freely between these structures, without preventing people from returning to academia from industry by the lack of publication record. What works is when industry is partnership with academia.

But the most important factor is the role of the government in creating a stable environment that attracts business investment. The report suggests it can do this by setting long-term objectives, by investing in the national infrastructure and by being a lead user. Particularly, the report refers to the expected transition to a sustainable low carbon economy and the government’s role in assisting the development of the technologies we will need.

We need to up our game if we are to capitalise on our exceptional research base and it will rely on a partnership between academia, industry and the government brokered by exceptional people. It’s exciting to see such a vision, and to note that by looking outwards that the UK economy will benefit. For British post-graduate students or researchers it might make less comforting reading: the economy will thrive through attracting the best overseas researchers here, but our own home-grown researchers will only survive if they too are the best in the world.

Posted by: woodforthetrees | February 25, 2010

Breaking Nature’s NMR barrier

NMR 800 MHz

This is a fantastic time to be an NMR spectroscopist. Image via Wikimedia.

I’m glad to see a letter in today’s Nature commenting on the ‘overly pessimistic’ news feature on 3 February about the first gigahertz nuclear magnetic resonance spectrometer. What should have been a cause for celebration was ruined by a misunderstanding, perhaps even a misrepresentation, of the power and versatility of NMR.

Far from ‘attracting new life to NMR spectroscopy’ — suggesting that the field was on its knees — this investment shows how vibrant this area of research is, at a point when it is poised to make a major contribution to systems biology.

The article describes the world’s most powerful NMR machine, which is now up and running at the European Centre for High Field NMR (CRMN) in Lyon, France. At 1GHz, its magnet is 50 MHz stronger than its nearest rival, and it’s true to say that this a small increase relative to the history of NMR spectrometers, but this misses the point. The question that should have been asked is how powerful are the other machines within that region of France, even within Europe, and what breakthroughs can we expect at this strength?

The Nature news feature mentions that CRMN already has a 900 MHz instrument. But that doesn’t diminish the needed for a stronger magnet. We have barely scratched the surface with regards to structures of membrane proteins, whether they are obtained by crystallography or NMR, and yet these membrane proteins are extremely important for drug research and potential therapy. It’s highly likely that this new machine will begin to make inroads here. But the machine itself can’t solve structures: there are groups around the globe developing techniques that will help produce the structures of larger proteins –look at the paper Science just published.

The worst aspect of this feature is its narrow focus. Yes, NMR works well for small-ish proteins, but the important point is that it offers us opportunities that other techniques don’t. First, it can often produce structures where crystallography has failed. Second, the structure is obtained in solution, allowing us to see a dynamic picture. Third, it’s the only technique that allows us to detect weak, transient interactions, which will be vital for building up a picture of protein-protein and protein-ligand interactions for systems biology and beyond. And fourth, the first in-cell NMR structures were reported about 6 months ago – this will revolutionise our understanding of structures within the cell.

NMR protein spectroscopy is well and truly alive and kicking. So smile, Lyndon Emsley – for goodness sake! You’ve just installed the world’s most powerful NMR machine, worth $16.3 million. In the news feature, you look like you’ve just lost the winning Euro-millions lottery ticket.

Posted by: woodforthetrees | February 24, 2010

Web 2.0 meets membrane proteins


The NIH Roadmap Meeting on membrane proteins promises high risk interactive IT

The third Annual NIH Roadmap Meeting on membrane protein technologies looks very interesting. The past two have attracted an all-star line-up of scientists, just look at the 2009 one, and I would expect the same big names to be there this time.

Solving the structure of membrane proteins is incredibly difficult and the field is very competitive, with most structures going straight into Science or Nature. If a collaborative, sharing environment can be nutured in this arena, then there is hope for every field.

Two attractive developments for this year caught my eye. The first is that there will be a hands-on workshop for a technique called LCP, which produces a membrane-like environment in which some proteins will crystallise. This method can be a devil to perform, with researchers comparing the texture of the starting mixture to toothpaste – try dispensing that with precision through a syringe! Better still, if you take along a protein sample, you might be able to persuade someone else to do it for you.

The other aspect that intrigues me is this notice:

In the spirit of high risk experiments, we will be trying to combine additional IT technologies into the meeting to help create a highly dynamic and interactive setting. As a reminder, the goal of this meeting series is for a very open and shared environment where unpublished and creative technology ideas should be communicated by all attending participants.

I’d love to know more. Sounds like a date for the diary.

Posted by: woodforthetrees | February 22, 2010

A tale of two wikis

The NMR spectroscopy community is near to the tipping point for Web 2.0. Image via wikimedia.

A common lament among bloggers and other enthusiastic adopters of Web 2.0 technology is the lack of mainstream uptake of these tools by active scientists. A recent report from University of California Berkeley confirmed this reluctance to embrace new forms of sharing information.

Yet two wikis focused on NMR spectroscopy between them have 5% of the magnetic resonance community as registered users – and probably many more are casual viewers. It is my belief that Web 2.0 thrives where journals don’t, and that the NMR community might be the first to reach the tipping point, where your career is harmed by not contributing.

The first, NESG wiki, became available publicly a month or so ago, and I’ve written a short news piece here and the wiki itself is here. It was developed by Northeast Structural Genomics Consortium (NESG), a centre working on high-throughput methods for solving protein structures. They provide details of their protocols, right down to buffer conditions and how to get your protein sample into the NMR tubes without losing it. It isn’t aimed at complete beginners, and certainly not at the public, but the information they have is clear, and it’s well written, and even though I have limited hands-on NMR experience I can follow it.

NESG wiki

This wiki came into existence because it met a need. NESG is composed of several groups, and the NMR side of things is made up of the labs of Gaetano Montelione, Cheryl Arrowsmith, Mark Girvin, Michael Kennedy, John Markley, Robert Powers, James Prestegard and Thomas Szyperski. These labs are spread across the United States and Canada and so a way of pooling all the information was required. For most of its life, this was a private working wiki just for these groups. Now that it is open to the public, it will be fascinating to see how this project evolves and deals with ‘outsiders’. Currently it has around 50 registered users.

The second wiki is more established: the NMR wiki. It has a wider reach and covers the whole of magnetic resonance, not just NMR. It’s an excellent place for inspiration if you are teaching NMR to postgraduates or undergraduates:  it has slides, lectures, worksheets and quizzes. In addition, you are encouraged to upload your PhD or Masters thesis and share your pulse sequences and software. Also, it advertises jobs and conferences. It has an active question and answer section. Actually, I like what they do with this Q&A section so much that I’ll write about this a bit more another day, but for now I’ll just say that one of the nice things about the way they have developed it is that you get a feel for how many people are looking at this section and you can grade the answers.

NMR wiki

The NMR wiki is a real treasure-trove of magnetic resonance science, and if I went back into the lab tomorrow, it would be my life-line. It has 430 registered users, which the Director of the BioMolecular Spectroscopy Facility at the University of California Irvine, Evgeny Fadeev, estimates to be 4% of the magnetic resonance community.

There is some overlap between the two sites, but it’s worth looking at both because they provide different information: one focuses on high-throughput; the other on more traditional lab environments. But the significant thing isn’t the medium so much as the content: most of the information being freely shared is not publishable by academic journals. Though Nature Protocols, and to a lesser extent CSH Protocols, have an interest in this general area, the important nitty-gritty details of how to perform NMR spectroscopy don’t make a compelling story. So these wikis didn’t grow because of publishers but specifically because of their absence. Equally, academic journals do not take much interest in facilitating the sharing of teaching material for this field.

Another factor worth considering is that the magnetic resonance community is relatively small, meaning that many researchers will know each other and there will be strong interconnections between groups. With some of the leading names in NMR spectroscopy now using wikis, how close are magenetic resonance researchers to the tipping point?

Older Posts »