Monthly Archives: January 2010

Cracking the Genomics Code

This week I was in San Diego attending the International Plant and Animal Genome Conference. PAG is a conference that brings together academic and commercial researchers and product vendors, with a particular emphasis on agricultural applications. (One of the first signs I saw when entering the lobby was a sign directing attendees to a “Sheep and Cattle” workshop. I wondered who would be cleaning the hotel carpets.)

In an earlier post, Please pass the dot plots, I described how Greg Edvenson at Pico Computing used an FPGA cluster and C-to-FPGA methods to demonstrate acceleration of a DNA sequence comparison algorithm. The quick success of that project was reason enough for us to attend PAG and learn more about the computing problems in genomics. Where are acceleration solutions needed?

It’s clear there are problems aplenty to be solved. As one researcher said to us, “The amount of raw data being generated by DNA sequencers each month is outpacing Moore’s Law by a wide margin.” He went on to describe how his group routinely undocks and hand-carries their hard drives down the hall because the time required to move the generated sequencing data across their network is too long. Solutions are needed for accelerating data storage throughput, and for the actual computations to do such things as assemble whole genomes from the small chunks of scrambled DNA that currently emerge from sequencing machines.

Why all the data? The human genome is about 2.91 billion base pairs in length*, and it’s not the longest genome out there, not even close. We have more base pairs than a pufferfish (365 million base pairs) but far less than a lungfish (130 billion base pairs).

Evolution is a curious crucible.

Sequencing technologies have advanced quickly. Machines and software offered by Illumina, Life Technologies, Roche and others can generate enormous amounts of genetic data. The bottleneck at present is in assembling all that data – like a billion-part jigzaw puzzle thrown to the floor – into a meaningful, searchable DNA sequence. The methods of doing this assembly, using algorithms such as ABySS and Velvet, may require parallelizing the problem across many CPUs, and using large amounts of intermediate memory – potentially terabytes of it.

If you are a researcher trying to figure out, for example, how to increase crop yields in sub-Saharan Africa, then you might be very interested in knowing how to breed a more pest-resistant and productive variety of barley (5 billion base pairs) or wheat (over 16 billion base pairs).

And if you’re Dupont or Monsanto, you may want to actually create and patent such a grain to have a competitive advantage.

To figure out such things, you may want to perform sequence comparisons of other species that appear to have the characteristics you are interested, and find the relevant genetic variances. You won’t have a chance of doing this unless you can sequence many varieties and perform detailed analysis of what you see. This takes lots of computing time and bags of money.

And so the gemonics industry looks for faster solutions for cracking the codes of life. The solutions involve cluster and cloud computing, GPUs and FPGAs, and perhaps exotic hybrid computing platforms to come.


*A “base pair” is two complementary nucleotides in a DNA strand, connected by a hydrogen bond. There are four kinds of nucleotides that make up these base pairs: adenine, thymine, guanine and cytosine. In the human genome only a small fraction of these base pairs are actually representing genes. It seems our bodies are mostly “junk DNA“, perhaps proving that we are what we eat.

Leave a comment

Filed under Reconshmiguration

Time to throw away our GSM phones?

The mobile phone industry is in full PR battle mode this week with the news that a computer scientist has successfully cracked the A5/1 encryption code that secures GSM mobile phone calls. In theory this means that anyone having access to appropriate snooping hardware and software, estimated by the researcher to cost under $30,000, can listen in on GSM phone calls by intercepting and decoding radio signals.

Last week at the Chaos Communication Congress in Berlin, Dr. Karsten Nohl announced that his team, a group of hackers working collaboratively to create a distributed computing cluster, had cracked the encryption code by creating an enormous, 2-terabyte “rainbow table” of hash values. In simplistic terms, the rainbow table provides a cracking program with a reverse-lookup scheme that can quickly decrypt the wireless voice data.

I’ll leave aside any prediction of who might want to use this kind of cracking technology, and where they might want to do it. In the United States GSM is used for only a fraction of communications, most notably by AT&T and T-Mobile.

GSM dominates worldwide, however, carrying the overwhelming majority of phone calls. (And if you are an iPhone user like I am, you should know that AT&T most probably sends your voice via the 2G GSM standard using A5/1 encryption, even though you are paying for presumably more secure 3G service. And if you think your iPhone data is secure… read this.)

From a computing perspective, what’s interesting about this project is that it required two types of computational acceleration. The first computing problem was the creation of the rainbow tables. This only needed to be done one time, but represented a massive computing problem. Nohl estimated that to generate these tables using a single traditional PC or server would have required many years to complete. To make this problem practical, Nohl and his collaborators set up a distributed computing system similar to the SETI@Home project in which the spare computing cycles from many different computers on the Internet were harnessed to calculate the needed tables. In some of the computers GPUs were also used to accelerate the problem, which was completed in three months of calendar time.

The second computing problem occurs at the point of decryption, in whatever server or laptop PC is being used to snoop and crack the wireless signal. That problem is also computationally intensive, but with ready access to the 2-Terabyte rainbow tables the crack can be performed in minutes, or seconds if GPU and/or FPGA accelerators are added into the mix.

During his talk, Nohl stated that a person (or agency?) wanting to eavesdrop on GSM calls would currently need to spend around $100,000 on hardware in order to crack an A5/1 encrypted call in one second or less. And the hardware to use? A cluster of 64 or more FPGAs. For less money and slower cracking times (still under a minute, and under $30,000) a smaller number of FPGAs or GPUs would do the job just fine.

Slides from Nohl’s talk are here.

1 Comment

Filed under Uncategorized