Monthly Archives: June 2008

GP-GPU and FPGA

Kevin Morris at FPGA Journal has published an article titled A Passel of Processors, describing how the new NVIDIA Tesla GPU poses a direct threat to FPGAs in the domain of high-performance, hardware-accelerated computing.

According to NVIDIA, the Tesla provides teraflop performance in a single, massively parallel (240 cores) device. And it can be programmed using C-language.

Or at least something resembling C. Because, after all, when you have 240 cores and a massively parallel processing target, your C-language application is not likely to look like your father’s DEC PDP-11 sequential C-code.

I’ve leave the debate about whether GPUs will consistently beat FPGAs, in terms of gigaflops per watt, in a wide variety of applications, to someone else.

That’s a race that is not yet over.

What interests me is the common belief that FPGAs are inherently more difficult to program than GPUs. Are they? Go look at this Mersenne Twister example and the sample source code for it, then compare to this version (coded up using Impulse C). That’s a rather simple example, but it demonstrates the use of pipeline generation (controlled by a C-language pragma) and the use of a streaming programming model. These are concepts very similar to what is required for GP-GPU programming, or when programming cluster applications using OpenMP, etc.

The CUDA tools and tools like Impulse C provide extensions to the C-language to help software programmers deal with multi-process parallelism, in GPUs and FPGAs respectively. Rather than attempt to hide parallelism and multiple-process optimization methods from programmers, these environments embrace parallel programming methods using a base language, C, that is familiar to programmers.

To summarize: programming for either FPGAs and GPUs is challenging, but tools are available today that have, for the most part, made these devices usable by software programmers. Some amount of experience is needed to efficiently manage parallelism in the application, but the programming level of abstraction is not really so divergent in these very different types of acceleration platforms.

Advertisements

Leave a comment

Filed under Reconshmiguration

Picking your problems

Geno Valente of XtremeData has written an excellent article titled To Accelerate or not to accelerate, appearing this week on embedded.com.

A key point made in this article is that high-performance computing applications, meaning such things as life sciences, financial computing, oil and gas exploration and the like, can learn quite a lot from the world of embedded systems.

Embedded systems designers already face critical issues of power vs. performance and the need to combine a wide variety of diverse processing resources. These resources include embedded processors, DSPs, FPGAs, ASSPs and custom hardware. These devices, when combined, could be thought of as a hybrid processing platform.

In the HPC world, however, the dominant approach has been to create clusters of identical processors connected by high-speed networks. Only recently have alternative accelerators such as FPGA, GPU and Cell been added to the mix, and at this point only by a small minority of HPC users and platform developers.

So… I’m in total agreement; we need to start gently applying the same knowledge and understanding of system optimizations (including power optimizations, partitioning, and an understanding of pipelining and data streaming) that is commonplace in embedded systems design, to the problems facing HPC. And of course improve the tools and libraries to make accelerated computing platforms easier for HPC programmers to pick up and use.

Leave a comment

Filed under Embed This!, Reconshmiguration

A reconfigurable tick squeezer?

This headline caught my attention today:

Vhayu Introduces Hardware Compression for Its Tick Database
Combining Vhayu Velocity with an FPGA, Squeezer compresses data by a factor of four with no performance penalty, says the vendor.

My first thoughts on seeing that headline were:

  • Is there really such a large database on ticks that hardware compression is required?
  • Would somebody actually call such a tool, “Squeezer”?
  • Is this April 1st?

On further reading, however, I realized that “tick” means “ticker”, as in market trading data.

Aha!

In fact, feed handling is a domain in which FPGAs are starting to gain significant traction. It’s all about getting the lowest possible latency: data comes directly into the system from a market feed source, and a hardware-based algorithm makes a split-second trading decision based on observed patterns. A sudden downturn in the price on key indicator stocks, for example. Or a spiking in oil prices, or a drop in the Brazilian Real, or whatever.

The trading house, hedge fund or bank that sees the pattern and reacts first is that one that wins. And so it’s a latency war out there. FPGAs represent one solution to the latency problem, and have been deployed in numerous trading-related market data appliances.

I like this quote attributed to Jeff Hudson, Vhayu CEO:

“It’s a hybrid software/hardware world we’re entering now, and those companies that embrace it will prosper and those that don’t will fall way behind.”

Indeed.

See also: High Frequency Traders Get Boost From FPGA Acceleration

1 Comment

Filed under News Shmews, Reconshmiguration

The Petaflop Playstation

Supercomputing has reached a new milestone with the announcement that IBM and Los Alamos National Labs have cracked the petaflop performance barrier.

For those who don’t speak in mops, bops, flops and no-ops, a petaflop is a measurement of performance that equates to performing one thousand trillion floating point calculations per second.

That’s a whole lot of math.

The new supercomputer, called Roadrunner, is built from commodity components including a reported 7000 standard, off-the-shelf AMD Opteron processors closely coupled to some 12,000 IBM Cell Processors.

That’s less than 20,000 commodity chips. Commodity, as in standard, widely available, non-exotic. You probably don’t have an Opteron in the computer you are using right now, but you almost certainly used one (or more likely a cluster of them) just moments ago when you asked Google, or Microsoft, or Yahoo to do a web-search for you.

As for the Cell Processor (or Cell Broadband Engine as it’s more formally known) you probably know some kid who has one of those. It’s the brains inside the Sony Playstation.

Roadrunner uses the 7,000 Opteron processors to do general-purpose grid processing, while the 12,000 Cell processors serve as specialized application accelerators.

That may sound like a lot of chips, but consider this: The previous king of supercomputers, the IBM Blue Gene, has approximately half the performance of Roadrunner but uses 212,992 processors, and presumably consumes way more power than Roadrunner.

To summarize: This is exciting news for accelerated computing. Cell Processors, GPUs and FPGAs are all proving their worth in a new, hybrid multiprocessing world. The question, of course, is how do you program these things?

Leave a comment

Filed under News Shmews

A new low-power FPGA?

The power efficiency of FPGAs is either very good, or very bad, depending on who you’re talking to. If you are comparing math performance, meaning the number of integer or floating point operations performed per watt of power, then FPGAs look pretty good when placed side-by-side against a traditional processor or DSP. This is because the FPGA has less overhead; more of the transistors can be configured as parallel structures to do the real work of processing data and computing results. FPGAs can do the same work with fewer clock cycles, resulting in lower power consumption.

If, however, you are looking at low-power portable products such as a mobile phones and games, FPGAs are not even contenders. They have a reputation as power hogs. This is because FPGAs are dominated by interconnect and have flexible, application-independent structures. It is not possible, at least not for real applications, to make use of all the transistors in an FPGA. There is leakage, and there is static power needed because of the SRAM architecture of the most common devices. And so power is wasted.

Both Xilinx and Altera have low-power FPGA families (Spartan and Cyclone, respectively) and Actel devices are also power misers. But these devices still consume too much power for use in mobile devices.

So the news that startup SiliconBlue Technologies has a new and much lower power FPGA device is notable. Their devices are small (akin to complex PLDs) but quite FPGA-like in their design.

I’m not much of an expert on FPGA process technologies, but if SiliconBlue can actually get traction in mobile devices and scale these devices up to tackle more complex algorithms… then that could be an important step for reconfigurable embedded computing.

Leave a comment

Filed under News Shmews