July 21, 2008

An escalator is a beautiful thing

I was traveling this month, and had the unlucky situation of needed to take six different planes (three each way) for a trip to and from Illinois. My fault for booking late, but it was an opportunity to think about latency, throughput, optimization and reconfigurable computing.

Actually I didn’t think about any of those things while strolling though the various airports. I thought mostly about where to get a beer and a burrito between flights. But that’s the same problem.

The design of an airport is, or should be, about getting people and their bags from one place to another at a predictable speed, maximizing the number of individuals who pass through on their way from there to there. And for the most part they are well designed for that purpose.

Note that I said predictable speed, not high speed. The airport managers might not care very much if takes you 5 minutes or 7 minutes to get from one gate to another. What they care about that it takes everyone pretty much the same amount of time, and that the system doesn’t get clogged up and delayed during times of peak capacity. The goal is high throughput, getting the maximum number of people through the system in a given time. This often requires a tradeoff in latency, which is the time taken for any one individual to make the journey. The same concept applies not only to airports but also to highways, to manufacturing and to pretty much every complex system out there that involves long lines of people, animals or objects requiring one or more constrained resources.

In places like Chicago O’Hare or Tokyo Narita you see this kind of throughput planning all over the place. Sometimes it doesn’t work very well, but mostly it does. You see it in the queues for checkin, in the lines for security, in the way passengers are sequenced when getting onto the planes, and in the way bags come down onto the carousels. You see it in the way you order, pay for and pick up your six dollar venti nonfat soy frappe macchiatoccino. If you could go beyond what you can see as a mere passenger, then you would find it on the taxiways and in the skies overhead, as aircraft are sequenced for landing and departure, and handed off from controller to controller as they move from sector to sector.

And what is all this? A whole lot of parallel, pipelined and interconnected systems.

Consider airport escalators for a moment. When you are on one, especially a very long one, it might seem awfully slow. But what an amazing and beautiful thing an escalator is, when you think about throughput.

I have a specific example in mind. If you take the express train from Tokyo to Narita airport, you and hundreds of other people will simultaneously arrive at a platform deep underneath the airport, somewhere around basement level 5. Everyone on that train will have one or two bags, some will have small children, some will be old people who walk slowly, some will be impatient and fast… hence it should be total chaos when the doors of that train open and people try to make their way up to the checkin lines, seven levels up. But the Narita escalators practically eliminate that chaos. They do this by forcing people to get on at a constant rate, one person, or a pair of people, for every two or three steps. At capacity there are many hundreds of people simultaneously moving up from level to level. The escalators provide a smoothing function, delivering people in an orderly manner to the lines at the counters, and at a much higher effective rate, more people per minute than could possibly be provided by elevators or stairs.

How does this relate to reconfigurable computing? Well… a non-technical friend recently asked me to explain the difference between traditional processing, using an x86 processor, for example, and parallel processing in an FPGA. I was tossing around words like “pipelining” and “throughput” without relating those concepts to the real world. Then it dawned on me that traditional, sequential processing is like an elevator, and reconfigurable processing is like an escalator.

The traditional processor is a single elevator that carries a small number of passengers (data and instructions) from one floor to another. For decades, processor vendors have focused entirely on making their elevators run faster, working to get small sequences of data and instructions from the bottom floor to the top floor as quickly as possible. They have also increased the capacity of the elevator by increasing the word size (64 bits, for example) and by adding more complex instructions. But the elevator approach has inherent limitations. At busy times of the day, an elevator becomes a bottleneck no matter how fast it runs.

If we eliminate the elevator and instead build an escalator, or better yet multiple parallel escalators, then we can move a whole lot more passengers in the same amount of time, and probably do it using a whole lot less power. No doors to open and close, not as much shuffling for position in the queue, and no snot-nosed little kid pushing every button and jamming up the system. And if our airport or system is truly reconfigurable, we can deploy exactly the right combination of parallel escalators that are needed at any given time. Shut down or eliminate unused escalators in the middle of the night, and add more of them when we expect a lot of trains or planes to come in at the same time.

And there’s something else to consider: parallel programming is sometimes described as difficult, exotic and unnatural. Something that most software programmers just aren’t ready for. Well… I dunno. Clever people have been designing lean production, package sorting and transportation systems, and shopping malls and airports, for an awfully long time and are getting pretty darn good at it. Maybe we just need to hire some of these people to write and optimize our parallel algorithms?

July 20, 2008

Threads, pipelines and the demise of Moore’s Law

I came across an interview with Donald Knuth from June of this year, in which he throws some cold water on the current trend toward multicore computers. An excerpt:

…I might as well flame a bit about my personal unhappiness with the current trend toward multicore architecture. To me, it looks more or less like the hardware designers have run out of ideas, and that they’re trying to pass the blame for the future demise of Moore’s Law to the software writers by giving us machines that work faster only on a few key benchmarks! I won’t be surprised at all if the whole multithreading idea turns out to be a flop…

Strong words, but there’s a little tidbit later in the article suggesting where Knuth’s sympathies really are:

…So why should I be so happy about the future that hardware vendors promise? They think a magic bullet will come along to make multicores speed up my kind of work; I think it’s a pipe dream. (No—that’s the wrong metaphor! “Pipelines” actually work for me, but threads don’t. Maybe the word I want is “bubble.”)

I can relate to that. Being an old-school, PDP-11 era C programmer I never really grasped the intricacies of synchronization and how to write a decent threaded algorithm. Threading was never intuitive for me. I understood it at an abstract level, but it always felt like thread libraries and APIs were an awful invention, forcing programmers to contort their code in ways that rarely matched the actual application.

I have far less trouble wrapping my head around streaming and pipelining. The model of having multiple processes independently processing streams of data is intuitive because it exists everywhere around us, in the real world (see my next posting on that). Even very complex systems with nested pipelines of varying rates can be understood conceptually by programmers, and by non-programmers as well.

Over on Dobbs Code Talk there’s a blog post from James Reinders titled Pipelines/Streams offer easy parallel programming. In the posting Reinders offers the following concepts:

The “magic” which makes this all so easy for parallel programming comes from three things:

  1. to be parallel you need independent work to run in parallel: if you pipeline your work (streaming data) and you have no interdependencies other than the data streams themselves (no global side-effects) you get exactly that: independent work to run in parallel
  2. the pipeline stages themselves can be broken up to run in parallel by either data parallelism, or possibly a pipeline of their own (so nested parallelism is important)
  3. the very very sticky problem of data placement, which becomes a more and more severe problem in the future, is solved implicitly (the migration of data is very clear and very clean)

The above makes parallel programming using pipelined processes and streaming data seem rather simple and obvious. We do a lot of that kind of programming around here (using Impulse C) and yes, it’s a very good approach when targeting massively parallel architectures such as FPGAs. Maybe it’s the only practical method at the moment. Personally I would not characterize parallel programming and the design of highly pipelined algorithms as “easy”, but tools available today, including ours, make it practical for software programmers to write such programs and target non-traditional computing devices. The analysis and optimization of deeply pipelined, high-performance applications is still a significant challenge, but this challenge can be met with improved tools, and with the more intuitive programming models that streaming and pipelining represent.

July 8, 2008

Impulse and UW win medical imaging grant

Impulse and the University of Washington have won a grant from the Washington Technology Center for investigations into reconfigurable, FPGA-based computing in the domain of positron emission tomography (PET) scanning. This project, which will be directed by Scott Hauck at the University of Washington, will initially apply the Impulse C tools and multiple hardware platforms to the problem of filtered back-projection, then move into other areas of PET image construction. Note that the back-projection algorithm can also be used in applications such as synthetic aperterure radar, or SAR.

Scott Hauck, by the way, is the editor/author of a very good book titled Reconfigurable Computing: The Theory and Practice of FPGA-Based Computing.

June 19, 2008

GP-GPU and FPGA

Kevin Morris at FPGA Journal has published an article titled A Passel of Processors, describing how the new NVIDIA Tesla GPU poses a direct threat to FPGAs in the domain of high-performance, hardware-accelerated computing.

According to NVIDIA, the Tesla provides teraflop performance in a single, massively parallel (240 cores) device. And it can be programmed using C-language.

Or at least something resembling C. Because, after all, when you have 240 cores and a massively parallel processing target, your C-language application is not likely to look like your father’s DEC PDP-11 sequential C-code.

I’ve leave the debate about whether GPUs will consistently beat FPGAs, in terms of gigaflops per watt, in a wide variety of applications, to someone else.

That’s a race that is not yet over.

What interests me is the common belief that FPGAs are inherently more difficult to program than GPUs. Are they? Go look at this Mersenne Twister example and the sample source code for it, then compare to this version (coded up using Impulse C). That’s a rather simple example, but it demonstrates the use of pipeline generation (controlled by a C-language pragma) and the use of a streaming programming model. These are concepts very similar to what is required for GP-GPU programming, or when programming cluster applications using OpenMP, etc.

The CUDA tools and tools like Impulse C provide extensions to the C-language to help software programmers deal with multi-process parallelism, in GPUs and FPGAs respectively. Rather than attempt to hide parallelism and multiple-process optimization methods from programmers, these environments embrace parallel programming methods using a base language, C, that is familiar to programmers.

To summarize: programming for either FPGAs and GPUs is challenging, but tools are available today that have, for the most part, made these devices usable by software programmers. Some amount of experience is needed to efficiently manage parallelism in the application, but the programming level of abstraction is not really so divergent in these very different types of acceleration platforms.

June 12, 2008

Picking your problems

Geno Valente of XtremeData has written an excellent article titled To Accelerate or not to accelerate, appearing this week on embedded.com.

A key point made in this article is that high-performance computing applications, meaning such things as life sciences, financial computing, oil and gas exploration and the like, can learn quite a lot from the world of embedded systems.

Embedded systems designers already face critical issues of power vs. performance and the need to combine a wide variety of diverse processing resources. These resources include embedded processors, DSPs, FPGAs, ASSPs and custom hardware. These devices, when combined, could be thought of as a hybrid processing platform.

In the HPC world, however, the dominant approach has been to create clusters of identical processors connected by high-speed networks. Only recently have alternative accelerators such as FPGA, GPU and Cell been added to the mix, and at this point only by a small minority of HPC users and platform developers.

So… I’m in total agreement; we need to start gently applying the same knowledge and understanding of system optimizations (including power optimizations, partitioning, and an understanding of pipelining and data streaming) that is commonplace in embedded systems design, to the problems facing HPC. And of course improve the tools and libraries to make accelerated computing platforms easier for HPC programmers to pick up and use.

June 10, 2008

A reconfigurable tick squeezer?

This headline caught my attention today:

Vhayu Introduces Hardware Compression for Its Tick Database
Combining Vhayu Velocity with an FPGA, Squeezer compresses data by a factor of four with no performance penalty, says the vendor.

My first thoughts on seeing that headline were:

  • Is there really such a large database on ticks that hardware compression is required?
  • Would somebody actually call such a tool, “Squeezer”?
  • Is this April 1st?

On further reading, however, I realized that “tick” means “ticker”, as in market trading data.

Aha!

In fact, feed handling is a domain in which FPGAs are starting to gain significant traction. It’s all about getting the lowest possible latency: data comes directly into the system from a market feed source, and a hardware-based algorithm makes a split-second trading decision based on observed patterns. A sudden downturn in the price on key indicator stocks, for example. Or a spiking in oil prices, or a drop in the Brazilian Real, or whatever.

The trading house, hedge fund or bank that sees the pattern and reacts first is that one that wins. And so it’s a latency war out there. FPGAs represent one solution to the latency problem, and have been deployed in numerous trading-related market data appliances.

I like this quote attributed to Jeff Hudson, Vhayu CEO:

“It’s a hybrid software/hardware world we’re entering now, and those companies that embrace it will prosper and those that don’t will fall way behind.”

Indeed.

See also: High Frequency Traders Get Boost From FPGA Acceleration

June 9, 2008

The Petaflop Playstation

Supercomputing has reached a new milestone with the announcement that IBM and Los Alamos National Labs have cracked the petaflop performance barrier.

For those who don’t speak in mops, bops, flops and no-ops, a petaflop is a measurement of performance that equates to performing one thousand trillion floating point calculations per second.

That’s a whole lot of math.

The new supercomputer, called Roadrunner, is built from commodity components including a reported 7000 standard, off-the-shelf AMD Opteron processors closely coupled to some 12,000 IBM Cell Processors.

That’s less than 20,000 commodity chips. Commodity, as in standard, widely available, non-exotic. You probably don’t have an Opteron in the computer you are using right now, but you almost certainly used one (or more likely a cluster of them) just moments ago when you asked Google, or Microsoft, or Yahoo to do a web-search for you.

As for the Cell Processor (or Cell Broadband Engine as it’s more formally known) you probably know some kid who has one of those. It’s the brains inside the Sony Playstation.

Roadrunner uses the 7,000 Opteron processors to do general-purpose grid processing, while the 12,000 Cell processors serve as specialized application accelerators.

That may sound like a lot of chips, but consider this: The previous king of supercomputers, the IBM Blue Gene, has approximately half the performance of Roadrunner but uses 212,992 processors, and presumably consumes way more power than Roadrunner.

To summarize: This is exciting news for accelerated computing. Cell Processors, GPUs and FPGAs are all proving their worth in a new, hybrid multiprocessing world. The question, of course, is how do you program these things?

June 3, 2008

A new low-power FPGA?

The power efficiency of FPGAs is either very good, or very bad, depending on who you’re talking to. If you are comparing math performance, meaning the number of integer or floating point operations performed per watt of power, then FPGAs look pretty good when placed side-by-side against a traditional processor or DSP. This is because the FPGA has less overhead; more of the transistors can be configured as parallel structures to do the real work of processing data and computing results. FPGAs can do the same work with fewer clock cycles, resulting in lower power consumption.

If, however, you are looking at low-power portable products such as a mobile phones and games, FPGAs are not even contenders. They have a reputation as power hogs. This is because FPGAs are dominated by interconnect and have flexible, application-independent structures. It is not possible, at least not for real applications, to make use of all the transistors in an FPGA. There is leakage, and there is static power needed because of the SRAM architecture of the most common devices. And so power is wasted.

Both Xilinx and Altera have low-power FPGA families (Spartan and Cyclone, respectively) and Actel devices are also power misers. But these devices still consume too much power for use in mobile devices.

So the news that startup SiliconBlue Technologies has a new and much lower power FPGA device is notable. Their devices are small (akin to complex PLDs) but quite FPGA-like in their design.

I’m not much of an expert on FPGA process technologies, but if SiliconBlue can actually get traction in mobile devices and scale these devices up to tackle more complex algorithms… then that could be an important step for reconfigurable embedded computing.

May 24, 2008

More FPGAs on Mars

The Phoenix Mars lander is scheduled to be on the surface of Mars this weekend, on Sunday. At least one Actel FPGA is on board, handling pressure and temperature data processing.

Note that Actel and Xilinx FPGAs are already rolling around on Mars, in the Spirit and Opportunity rovers.

Details of the Phoenix lander and its gadgets can be found on the JPL Mars Page.

May 22, 2008

FPGA and DSP Co-Design

Our pals at 3L put together a nice demonstration using a Sundance DSP/FPGA development board. Check it out:

DSP/FPGA Co-Processing Demonstrates 20X Acceleration using Software-to-Hardware Design Flow