Tag Archives: fpga

Partial reconfiguration – it’s about time

Altera today announced that its next-generation, 28nm devices will support partial reconfiguration. This means that Altera FPGA users will now be able to update small portions of an FPGA device, rather than having to re-synthesize, re-map and reprogram the entire device every time the application changes.

Xilinx has supported this feature for quite a few years, but has never made it a mainstream, supported feature of their ISE Design Suite development tools. However, in a recent email update to its customers, Xilinx announced with almost no fanfare that, in version 12 of the Xilinx tools, this important capability would finally be given its proper status as a supported feature. Perhaps the marketing department at Xilinx heard rumors coming from the other side of San Jose?

Corporate intrigue aside, this is an important piece of news. Why? Because software application developers first investigating FPGAs are surprised – even shocked – to learn how long the iteration times are for programming and debugging FPGA applications. And the larger the FPGA is, the worse the problem becomes.

For most software developers, being faced with the equivalent of a two, three, or even eight hour iteration time for compile-link-test is completely unacceptable, no matter how much potential increase in performance there may be.

And if, at the end of that long iteration time all the programmer has is a non-working, difficult-to-debug bitmap and a thousand-lines long report filled with hardware-esque warning messages? Then forget it. They might as well go use GPUs and CUDA.

So, why has it taken so long for this seemingly obvious feature to become mainstream? One can only assume that Xilinx and Altera management, their chip architects, and perhaps even their tools developers are hardware engineers first, and software engineers second. Perhaps they can only think of their devices as the poor-man’s ASIC. And their tools as a poor-man’s EDA.

Reliable, vendor-supported partial reconfiguration, including dynamic run-time reconfiguration, has the potential to solve so many problems for software application developers, and to broaden the market for FPGAs.

Consider:

Partial reconfiguration allows developers to iteratively recompile, re-synthesize, re-map and re-test a specific portion of the FPGA in just a few minutes, rather than a few hours. This in itself is a huge benefit. Instead of having to set up elaborate, hardware-oriented simulations to verify correct behavior before hitting that compile button, you can just try it out, again and again, in the same way you would when developing software. Indeed, this method of design was predominant for FPGAs in the 1980s and early 1990s. But as device densities have grown, iterative design-and-test methods have become impractical. That was good for the hardware simulator business, but bad for software developers.

Run-time reconfiguration allows certain parts of the FPGA to remain intact and functioning, for example the I/O processing, while other parts are being updated on-the-fly. Need to change the filtering of a video signal to respond to a change in resolution? Wham, it’s done, and so fast the viewer doesn’t even see a flicker. Want to change the FPGA’s function without causing a system reboot? How about leaving the PCIe endpoint alone and just changing the core algorithm?

Dynamic reconfiguration can increase the effective size of an FPGA. Loading of partial bitmaps at run-time allows the capacity of these devices to grow in the dimension of time. If you don’t need all the hardware capabilities you have programmed into your FPGA all the time, then use dynamic reconfiguration to swap hardware modules in and out as needed. Use reconfiguration to do more, with less.

Partial reconfiguration opens up new FPGA markets. Using partial reconfiguration, the vendors of FPGA-based platforms for specific types of applications – for financial transaction processing and automated trading, for example – could provide a “minimally programmable embedded system” in which most of the FPGA logic is pre-designed, pre-optimized and locked down, while a small portion in the middle is left available for end-user customization. This potentially opens up whole new application domains that were previously not available for FPGAs; applications in which the platform vendor and the end user both have unique domain knowledge, and have their own critical IP to protect.

Again, these capabilities are not really new; the Xilinx partial reconfiguration features have been used successfully, for years, in domains such as software-defined radio. What’s changing is that partial reconfiguration is finally becoming officially supported. With uncertainties about its future and its supportability reduced, software and platform vendors will now begin using these features to enable new programming, debugging and operating features into their own products.

It’s about time.

3 Comments

Filed under News Shmews, Reconshmiguration

Time to throw away our GSM phones?

The mobile phone industry is in full PR battle mode this week with the news that a computer scientist has successfully cracked the A5/1 encryption code that secures GSM mobile phone calls. In theory this means that anyone having access to appropriate snooping hardware and software, estimated by the researcher to cost under $30,000, can listen in on GSM phone calls by intercepting and decoding radio signals.

Last week at the Chaos Communication Congress in Berlin, Dr. Karsten Nohl announced that his team, a group of hackers working collaboratively to create a distributed computing cluster, had cracked the encryption code by creating an enormous, 2-terabyte “rainbow table” of hash values. In simplistic terms, the rainbow table provides a cracking program with a reverse-lookup scheme that can quickly decrypt the wireless voice data.

I’ll leave aside any prediction of who might want to use this kind of cracking technology, and where they might want to do it. In the United States GSM is used for only a fraction of communications, most notably by AT&T and T-Mobile.

GSM dominates worldwide, however, carrying the overwhelming majority of phone calls. (And if you are an iPhone user like I am, you should know that AT&T most probably sends your voice via the 2G GSM standard using A5/1 encryption, even though you are paying for presumably more secure 3G service. And if you think your iPhone data is secure… read this.)

From a computing perspective, what’s interesting about this project is that it required two types of computational acceleration. The first computing problem was the creation of the rainbow tables. This only needed to be done one time, but represented a massive computing problem. Nohl estimated that to generate these tables using a single traditional PC or server would have required many years to complete. To make this problem practical, Nohl and his collaborators set up a distributed computing system similar to the SETI@Home project in which the spare computing cycles from many different computers on the Internet were harnessed to calculate the needed tables. In some of the computers GPUs were also used to accelerate the problem, which was completed in three months of calendar time.

The second computing problem occurs at the point of decryption, in whatever server or laptop PC is being used to snoop and crack the wireless signal. That problem is also computationally intensive, but with ready access to the 2-Terabyte rainbow tables the crack can be performed in minutes, or seconds if GPU and/or FPGA accelerators are added into the mix.

During his talk, Nohl stated that a person (or agency?) wanting to eavesdrop on GSM calls would currently need to spend around $100,000 on hardware in order to crack an A5/1 encrypted call in one second or less. And the hardware to use? A cluster of 64 or more FPGAs. For less money and slower cracking times (still under a minute, and under $30,000) a smaller number of FPGAs or GPUs would do the job just fine.

Slides from Nohl’s talk are here.

1 Comment

Filed under Uncategorized

Please pass the dot plots

In the past year there has been an increased level of skepticism regarding FPGAs as computing devices. Large amounts of ink of been spilled regarding the emergence of GPUs as general-purpose computing platforms. NVIDIA’s Tesla is racking up high benchmark scores in domains that include computational finance, scientific computing, geophysics and many others.

Nonetheless, there are certain domains in which FPGAs are clear winners over GPUs, particularly when power consumption is factored into the results. Two of these domains are crypto-analysis (code cracking) and bioinformatics.

The CHREC group at the University of Florida, and Pico Computing of Seattle, have both recently announced benchmark results for DNA sequencing algorithms. Both groups used FPGA clusters to perform massively parallel computations and to accelerate the comparing and scoring of DNA base pairs by orders of magnitude.

The Florida group, led by Dr. Alan George, implemented a Smith-Waterman sequencing algorithm on a cluster of 96 Altera high-capacity FPGA devices, using PCI Express FPGA cards supplied by GiDEL.

The Novo-G cluster used for this project consists of 16 Linux servers, each housing a quad-FPGA accelerator board from GiDEL. According to the CHREC team, Novo-G’s performance was compared with an optimized software implementation executed on a single 64-bit, 2.4GHz AMD Opteron core. A speedup of 40,849X was observed. The implication is that a bioinformatics calculation that would take days to run on a single desktop workstation or server would require just seconds to complete using the Novo-G FPGA cluster.

In Seattle, Pico Computing implemented a similar algorithm that performs sequence analysis and scoring to create a 2-dimensional figure called a dot plot. The Pico team reported that they had achieved greater than 5000X acceleration of their algorithm, using a cluster of 112 Xilinx Spartan-3 FPGA devices. The Pico cluster consumed less than 300 Watts of power, with all FPGAs fitting comfortably into a single 4U server chassis.

Perhaps more interesting about the Pico Computing project was how it was developed. Greg Edvenson of Pico used a single FPGA device during initial algorithm development. The FPGA was encapsulated in a Pico Computing E-17 card attached directly to Greg’s laptop computer via an ExpressCard interface. After the algorithm was tested and working as a single hardware process, Edvenson then scaled up and replicated the algorithm for deployment on the FPGA cluster. Greg used C-to-FPGA tools provided by Impulse Accelerated Technologies during the development of the algorithms, reducing the need to write low-level HDL code.

In summary, bioinformatics is one application domain in which FPGA acceleration offers clear and compelling benefits. And as well, there are multiple FPGA cluster approaches that can be taken to meet the needs of the application, and to meet the constraints of budget and power consumption.

1 Comment

Filed under News Shmews

Dirty words and wardrobe malfunctions

Saving the world, one $#&^! FPGA at a time…

School system to use Algolith Profanity Cleaner and Delay System for live broadcasts

Okay, so maybe it was a slow news day when this got picked up. But digging deeper, what Algolith is offering is a reconfigurable, FPGA-based delay and filtering application for audio and video.

In their literature, Algolith is promoting the concept of One Card, One Price, More Choices. This is fundamentally what reconfigurable hardware is all about, and it makes sense for both the customer and for Algolith.

Why? Because video filtering and other broadcast video applications require hardware solutions, but don’t have static requirements. What’s needed for broadcasting, say, the Superbowl with 50 or more HD cameras, chaotic real-time action and thousands of opportunities for verbal and visual naughtiness, is probably quite different than the requirements for broadcasting a heathcare town meeting with a few unruly seniors.

How does reconfigurable hardware come in? A vendor such as Algolith can design and produce a hardware based solution, such as an HD-compatible video card, that has at its core one or more FPGAs. Some of the logic in these FPGAs is fixed and rarely updated, handling those parts of video processing that don’t change. Interfacing with video and network I/O devices, for example. Other parts of the video processing, however, are reconfigurable, allowing new types of clever delay filtering and other video gymnastics to be performed by the customer. Want to fuzz out someone’s unfortunate wardrobe problems with one mouse click, and track the offending [insert noun here] even as it moves around the scene? How about hiding all those annoying non-sponsor brand logos? Hey, somebody’s got to keep the viewers safe.

I don’t know if Algolith offers such high-zoot features in its video products, but these sorts of capabilities are certainly possible to implement in FPGAs today.

From a marketing perspective, the really attractive thing about reconfigurable hardware is the ability for a company like Algolith to become more than a hardware vendor – which is not a particularly scalable business – into a vendor of IP and services that use their reconfigurable hardware product as a platform for value-added options and reconfigurable firmware upgrades. And that’s a way $&@%*%! better business to be in.

Leave a comment

Filed under News Shmews, Reconshmiguration

FPGA-accelerated financial analytics get real

Automated trading and near-real-time financial analytics have been hot topics for some years now. Large organizations such as Bank of America deploy massive compute clusters to do such things as calculate the present value of options, or to model credit derivatives, in a virtual arms race to make trades with ever-higher levels of accuracy and ever-lower latencies. The banks and hedge funds that win this race each day have the potential to make millions or billions of dollars in extra profits. Vast amounts of power are consumed to drive the world’s most advanced supercomputers in a constant quest to produce, well, nothing at all… Just information used to move wealth from one global pants-pocket to another.

And all in the pursuit of market efficiency, of course. Hopefully all this money-shuffling is good for my meager retirement portfolio.

Editorializing aside, there has been a lot of buzz about the role of accelerators, including FPGAs, in financial applications. XtremeData this week generated some press regarding their new accelerated database for analytics. Their solution is attractive because it combines an FPGA module with an industry-standard HP Proliant server to accelerate specific algorithms (in this case SQL queries) by 15X over software-only equivalents.

As an industry, we need more turnkey solutions that highlight the benefits of FPGA acceleration. With enough such applications out there, the demand for programming and hardware platform solutions for other, possibly unrelated applications will increase.

Assuming, of course, all this financial alchemy doesn’t once again turn gold into lead.

Leave a comment

Filed under News Shmews, Reconshmiguration

Loring Wirbel on “Loose threads and blank slates”

Loring Wirbel (FPGA Gurus blog) provides good perspective on recent developments and setbacks in reconfigurable architectures, and the risks faced by FPGA startups in the current environment:

Loose Threads and Blank Slates

Leave a comment

Filed under Reconshmiguration

Medical imaging gets an FPGA boost

FPGAs are finding increased use in medical electronics. Frost and Sullivan reported in 2007 that FPGAs in medical imaging, including X-Ray, CT, PET, MRI, and ultrasound, already represented as much as $138M in revenue to FPGA companies, with CT alone representing $10M or more of that amount. Steady growth in these applications was forecast through 2011.

FPGAs assist medical imaging in two areas, detection and image construction. The detection part of medical imaging is an embedded systems application, with real-time performance requirements and significant hardware interface challenges. Image reconstruction, on the other hand, is more like a high-performance computing problem.

Image capture and display in computed tomography involves synchronizing large numbers of detectors arranged in a ring around the patient, in the large doughnut structure that we associate with CT, MRI and PET scanning. These detectors are often implemented using FPGAs, many hundreds of them, and already represent a large and profitable market for programmable logic devices.

While FPGAs are well-established in the detector part of the imaging problem, they can also help solve a significant problem in image reconstruction. They do this by serving as computing engines – as dedicated software/hardware application accelerators.

Tomographic reconstruction is a compute-intensive problem; the process of creating cross-sectional images from data acquired by a scanner requires a large amount of CPU cycles. The primary computational bottleneck after data capture by the scanner is the back-projection of the acquired data into image space to reconstruct the internal structure of the scanned object.

University of Washington graduate researchers Nikhil Subramanian and Jimmy Xu, working under the direction of Dr. Scott Hauck, recently completed a project evaluating the use of higher level programming methods for FPGAs, using back-projection as a benchmark algorithm. Nikhil and Jimmy achieved well over 100X speedup of the algorithm over a software-only equivalent. The target hardware for this evaluation was an XtremeData coprocessor module. This module, the XD1000, is based on Altera FPGA devices and serves as coprocesser to an AMD Opteron processor running Linux, via a HyperTransport socket interface.

This project, which was funded in part by a $100,000 Research and Technology Development grant from Washington Technology Center, was intended to determine the tradeoffs of using higher-level FPGA programming methods for medical imaging, radar and other applications requiring high throughput image reconstruction.

The key to accelerating back-projection is to exploit parallelism in the computation. Working in cooperation with Dr. Adam Alessio of the UW Department of Radiology, the two researchers converted and refactored an existing back-projection algorithm, using both C-to-FPGA (the Impulse C tools) and Verilog HDL, to evaluate design efficiency and overall performance.

This conversion, which included refactoring the algorithm for parallel execution in both C and Verilog, took 2/3 of the time when working in C than when working in Verilog. Perhaps more importantly, the two researchers found that later design revisions and iterations were much faster when working in C, with as little as 1/7 the time being required to make algorithm modifications when compared to Verilog.

The quick success of this project showed how even first-time users of C-to-FPGA methods can rival the results achieved from hand-coding in HDL, with surprisingly little performance penalty and faster time-to-deployment.

The results of this study have been published as Nikhil’s Master’s Thesis, which is available here: A C-to-FPGA Solution for Accelerating Tomographic Reconstruction.

1 Comment

Filed under News Shmews, Reconshmiguration