On Christmas Day 2009, towards the evening, news began to trickle through of an attempt to blow up a plane bound for Detroit. Over the coming days more information surfaced; firstly of the quick thinking and bravery of the Dutch film-maker Jasper Schuringa, who managed to extinguish the flames apparently coming from the lap of a young Nigerian man named Umar Farouk Abdulmutallab.
But more astonishingly we found out that the father of the alleged bomber had contacted the US embassy in Nigeria to report his son’s erratic and suspicious behavior. As a parent this struck me: how agonising it must have been to inform on the son you love, how sure you’d have to be that lives were at stake to give this information to a government that does not have an exemplary record for fair treatment of suspected terrorists.
And yet this key information was lost or simply not deemed important: Umar Farouk Abdulmutallab was not added to the ‘No fly’ list and he duly boarded the plane from Amsterdam to Detroit.
I heard the news, considered it for a few minutes, then shrugged my shoulders and just got on with normal life.
Two weeks later I flew to Denver for a structural biology meeting. I waited patiently in the immigration queue with all the other non-US citizens and I dutifully allowed the US government to take copies of my fingerprints and a photograph of me for their files. And it struck me: they just can’t see the wood for the trees.
The millions and millions of fingerprints, photos and emails from innocent people on file are helping to obscure important data. So the unprecedented access the US government has – and the UK government is hot on its tails – to all sorts of private information is hindering, not helping.
Now ‘big data’ isn’t new to scientists. The human genome project was the watershed, and worldwide more than 1000 genomes have been sequenced. Structural genomics groups have solved the structures of thousands of proteins. Proteomics groups are churning out huge datasets too. As scientists, we’re certainly good at producing data. But how good are we at analysing and using the data?
The idea of this blog is to enlist your help in making sense of all this information. I’m the editor of the Structural Genomics Knowledgebase and also of the Signaling Gateway. I want to explain a bit about these projects, and I’d like your help to understand datasets I can’t get my head around. But first, a question for you.
Imagine you work for the CIA and are reponsible for their database about suspected terrorists. You’ve just been passed a note saying that a Nigerian man has rung in expressing his concerns that his son might be involved in terrorist activity. How would you make sure that this information doesn’t get lost amongst all the information? My guess is, you’d set up your database to weight your information. What would you do?