Ambric shuts down

This was a surprisingly fast end to a company that looked like a rare survivor amidst the scattered carcasses of reconfigurable computing startups. Ambric actually had working systems, had customer wins, over $30M in revenue (according to VentureWeb) and academic/research partners that believed in the technology. What they apparently lacked was the confidence and patience of the investment community, which had put in something like $36M to-date and were perhaps spooked by the $200M failure of MathStar, a Portland-area reconfigurable computing neighbor, earlier this year.

If there’s any lesson in this, it’s that the funding environment for new reconfigurable computing devices is getting challenging, and the best bets are on companies making use of established FPGA devices, imperfect though FPGAs may be for RC.

Leave a comment

Filed under News Shmews

Stars of the tiny screen

Last month Michael Kreeger and I put together a 4-part YouTube demonstration using a Pico Computing EX-300 card. The demo walks through the basics of Impulse C, shows how the tools are used for software simulation and hardware generation, and shows how to download and run a finished FPGA bitmap on the card.

Pico Computing EX-300 - 16 FPGAs on one PCI Express card

Pico Computing EX-300 - 16 FPGAs on one PCI Express card

(The YouTube video is not very high resolution, making the screen text difficult to read. A better version with higher-res video and better sound can be found here.)

This demonstration shows how PCI Express FPGA cards are making prototyping of high performance, hardware-accelerated algorithms a whole lot easier than in the past. I predict we’ll be doing more YouTube tutorials in the future, and showing some rather impressive acceleration projects as FPGA-based platforms continue to evolve.

Leave a comment

Filed under Reconshmiguration

Progeniq accelerates animation rendering

There is a press release from Progeniq on the wires today, announcing their new RenderBoost product for higher performance CGI and image effects rendering.

Progeniq is a Singapore-based company focused on FPGA-based reconfigurable computing applications. They previously announced a product called BioBoost oriented toward bioinformatics acceleration. By following BioBoost with a product in a very different target industry, Progeniq appears to be establishing themselves as application-independent reconfigurable computing experts.

But why image rendering? Aren’t there some heavy players in that already, most notably Mental Images and Pixar?

There’s no question that image rendering in the entertainment industry is a big deal. Applications such as Pixar’s Renderman/PRMan are typically run on clusters and perform such things as solid modeling, ray tracing, shading, texturing, hidden object removal and many other tasks. These tasks can be parallelized and scaled up on multiple processors, but nonetheless the rendering of a single frame of a movie such as Rattatouie or The Incredibles can require vast amounts of CPU time. In fact it has been reported that Rattatouie required over 1500 CPU-years to render.

Obviously, anything that can speed image rendering is of great interest to the business. Not just the movie business, but also the advertising industry as well as animated solid models for engineering, science, and medical applications. But what might make this particular industry most attractive to a company like Progeniq is the existence of standards (de facto and real) and the opportunity to hook into existing tools via APIs and animation workflow tools.

There’s an important lesson here, I think. Any dramatically new technology, even one based on commodity devices such as FPGAs, is much easier to promote if it ties into existing and well-understood tools and workflows. If it can be as simple as plugging in an extra box or a PCI Express card and adding a few simple API calls to an existing application… well that seems like a darned easy sale to make. Good luck to Progeniq!

Leave a comment

Filed under News Shmews

Intel announces they are going to Tukwila*

I’m guessing the marketing folks down in Portland are not so familiar with Seattle-area urban lore…

Intel Stacks Up the Chips for Your Future PC

“The latest chip in Intel’s Itanium family for server and high-performance computing systems, Tukwila, is expected to be delivered late this year to servermakers. Systems using the chip will ship in late 2009.

“Tukwila is a 65-nanometer chip and the first quad-core member of the Itanium product family. The processor has more than 2 billion transistors on it.”

For a Seattle-area inside joke, see also: Urban Dictionary

Leave a comment

Filed under News Shmews, Reconshmiguration

An escalator is a beautiful thing

I was traveling this month, and had the unlucky situation of needed to take six different planes (three each way) for a trip to and from Illinois. My fault for booking late, but it was an opportunity to think about latency, throughput, optimization and reconfigurable computing.

Actually I didn’t think about any of those things while strolling though the various airports. I thought mostly about where to get a beer and a burrito between flights. But that’s the same problem.

The design of an airport is, or should be, about getting people and their bags from one place to another at a predictable speed, maximizing the number of individuals who pass through on their way from there to there. And for the most part they are well designed for that purpose.

Note that I said predictable speed, not high speed. The airport managers might not care very much if takes you 5 minutes or 7 minutes to get from one gate to another. What they care about that it takes everyone pretty much the same amount of time, and that the system doesn’t get clogged up and delayed during times of peak capacity. The goal is high throughput, getting the maximum number of people through the system in a given time. This often requires a tradeoff in latency, which is the time taken for any one individual to make the journey. The same concept applies not only to airports but also to highways, to manufacturing and to pretty much every complex system out there that involves long lines of people, animals or objects requiring one or more constrained resources.

In places like Chicago O’Hare or Tokyo Narita you see this kind of throughput planning all over the place. Sometimes it doesn’t work very well, but mostly it does. You see it in the queues for checkin, in the lines for security, in the way passengers are sequenced when getting onto the planes, and in the way bags come down onto the carousels. You see it in the way you order, pay for and pick up your six dollar venti nonfat soy frappe macchiatoccino. If you could go beyond what you can see as a mere passenger, then you would find it on the taxiways and in the skies overhead, as aircraft are sequenced for landing and departure, and handed off from controller to controller as they move from sector to sector.

And what is all this? A whole lot of parallel, pipelined and interconnected systems.

Consider airport escalators for a moment. When you are on one, especially a very long one, it might seem awfully slow. But what an amazing and beautiful thing an escalator is, when you think about throughput.

I have a specific example in mind. If you take the express train from Tokyo to Narita airport, you and hundreds of other people will simultaneously arrive at a platform deep underneath the airport, somewhere around basement level 5. Everyone on that train will have one or two bags, some will have small children, some will be old people who walk slowly, some will be impatient and fast… hence it should be total chaos when the doors of that train open and people try to make their way up to the checkin lines, seven levels up. But the Narita escalators practically eliminate that chaos. They do this by forcing people to get on at a constant rate, one person, or a pair of people, for every two or three steps. At capacity there are many hundreds of people simultaneously moving up from level to level. The escalators provide a smoothing function, delivering people in an orderly manner to the lines at the counters, and at a much higher effective rate, more people per minute than could possibly be provided by elevators or stairs.

How does this relate to reconfigurable computing? Well… a non-technical friend recently asked me to explain the difference between traditional processing, using an x86 processor, for example, and parallel processing in an FPGA. I was tossing around words like “pipelining” and “throughput” without relating those concepts to the real world. Then it dawned on me that traditional, sequential processing is like an elevator, and reconfigurable processing is like an escalator.

The traditional processor is a single elevator that carries a small number of passengers (data and instructions) from one floor to another. For decades, processor vendors have focused entirely on making their elevators run faster, working to get small sequences of data and instructions from the bottom floor to the top floor as quickly as possible. They have also increased the capacity of the elevator by increasing the word size (64 bits, for example) and by adding more complex instructions. But the elevator approach has inherent limitations. At busy times of the day, an elevator becomes a bottleneck no matter how fast it runs.

If we eliminate the elevator and instead build an escalator, or better yet multiple parallel escalators, then we can move a whole lot more passengers in the same amount of time, and probably do it using a whole lot less power. No doors to open and close, not as much shuffling for position in the queue, and no snot-nosed little kid pushing every button and jamming up the system. And if our airport or system is truly reconfigurable, we can deploy exactly the right combination of parallel escalators that are needed at any given time. Shut down or eliminate unused escalators in the middle of the night, and add more of them when we expect a lot of trains or planes to come in at the same time.

And there’s something else to consider: parallel programming is sometimes described as difficult, exotic and unnatural. Something that most software programmers just aren’t ready for. Well… I dunno. Clever people have been designing lean production, package sorting and transportation systems, and shopping malls and airports, for an awfully long time and are getting pretty darn good at it. Maybe we just need to hire some of these people to write and optimize our parallel algorithms?

1 Comment

Filed under Reconshmiguration

Threads, pipelines and the demise of Moore’s Law

I came across an interview with Donald Knuth from June of this year, in which he throws some cold water on the current trend toward multicore computers. An excerpt:

…I might as well flame a bit about my personal unhappiness with the current trend toward multicore architecture. To me, it looks more or less like the hardware designers have run out of ideas, and that they’re trying to pass the blame for the future demise of Moore’s Law to the software writers by giving us machines that work faster only on a few key benchmarks! I won’t be surprised at all if the whole multithreading idea turns out to be a flop…

Strong words, but there’s a little tidbit later in the article suggesting where Knuth’s sympathies really are:

…So why should I be so happy about the future that hardware vendors promise? They think a magic bullet will come along to make multicores speed up my kind of work; I think it’s a pipe dream. (No—that’s the wrong metaphor! “Pipelines” actually work for me, but threads don’t. Maybe the word I want is “bubble.”)

I can relate to that. Being an old-school, PDP-11 era C programmer I never really grasped the intricacies of synchronization and how to write a decent threaded algorithm. Threading was never intuitive for me. I understood it at an abstract level, but it always felt like thread libraries and APIs were an awful invention, forcing programmers to contort their code in ways that rarely matched the actual application.

I have far less trouble wrapping my head around streaming and pipelining. The model of having multiple processes independently processing streams of data is intuitive because it exists everywhere around us, in the real world (see my next posting on that). Even very complex systems with nested pipelines of varying rates can be understood conceptually by programmers, and by non-programmers as well.

Over on Dobbs Code Talk there’s a blog post from James Reinders titled Pipelines/Streams offer easy parallel programming. In the posting Reinders offers the following concepts:

The “magic” which makes this all so easy for parallel programming comes from three things:

  1. to be parallel you need independent work to run in parallel: if you pipeline your work (streaming data) and you have no interdependencies other than the data streams themselves (no global side-effects) you get exactly that: independent work to run in parallel
  2. the pipeline stages themselves can be broken up to run in parallel by either data parallelism, or possibly a pipeline of their own (so nested parallelism is important)
  3. the very very sticky problem of data placement, which becomes a more and more severe problem in the future, is solved implicitly (the migration of data is very clear and very clean)

The above makes parallel programming using pipelined processes and streaming data seem rather simple and obvious. We do a lot of that kind of programming around here (using Impulse C) and yes, it’s a very good approach when targeting massively parallel architectures such as FPGAs. Maybe it’s the only practical method at the moment. Personally I would not characterize parallel programming and the design of highly pipelined algorithms as “easy”, but tools available today, including ours, make it practical for software programmers to write such programs and target non-traditional computing devices. The analysis and optimization of deeply pipelined, high-performance applications is still a significant challenge, but this challenge can be met with improved tools, and with the more intuitive programming models that streaming and pipelining represent.

Leave a comment

Filed under Reconshmiguration

Impulse and UW win medical imaging grant

Impulse and the University of Washington have won a grant from the Washington Technology Center for investigations into reconfigurable, FPGA-based computing in the domain of positron emission tomography (PET) scanning. This project, which will be directed by Scott Hauck at the University of Washington, will initially apply the Impulse C tools and multiple hardware platforms to the problem of filtered back-projection, then move into other areas of PET image construction. Note that the back-projection algorithm can also be used in applications such as synthetic aperterure radar, or SAR.

Scott Hauck, by the way, is the editor/author of a very good book titled Reconfigurable Computing: The Theory and Practice of FPGA-Based Computing.

Leave a comment

Filed under News Shmews

GP-GPU and FPGA

Kevin Morris at FPGA Journal has published an article titled A Passel of Processors, describing how the new NVIDIA Tesla GPU poses a direct threat to FPGAs in the domain of high-performance, hardware-accelerated computing.

According to NVIDIA, the Tesla provides teraflop performance in a single, massively parallel (240 cores) device. And it can be programmed using C-language.

Or at least something resembling C. Because, after all, when you have 240 cores and a massively parallel processing target, your C-language application is not likely to look like your father’s DEC PDP-11 sequential C-code.

I’ve leave the debate about whether GPUs will consistently beat FPGAs, in terms of gigaflops per watt, in a wide variety of applications, to someone else.

That’s a race that is not yet over.

What interests me is the common belief that FPGAs are inherently more difficult to program than GPUs. Are they? Go look at this Mersenne Twister example and the sample source code for it, then compare to this version (coded up using Impulse C). That’s a rather simple example, but it demonstrates the use of pipeline generation (controlled by a C-language pragma) and the use of a streaming programming model. These are concepts very similar to what is required for GP-GPU programming, or when programming cluster applications using OpenMP, etc.

The CUDA tools and tools like Impulse C provide extensions to the C-language to help software programmers deal with multi-process parallelism, in GPUs and FPGAs respectively. Rather than attempt to hide parallelism and multiple-process optimization methods from programmers, these environments embrace parallel programming methods using a base language, C, that is familiar to programmers.

To summarize: programming for either FPGAs and GPUs is challenging, but tools are available today that have, for the most part, made these devices usable by software programmers. Some amount of experience is needed to efficiently manage parallelism in the application, but the programming level of abstraction is not really so divergent in these very different types of acceleration platforms.

Leave a comment

Filed under Reconshmiguration

Picking your problems

Geno Valente of XtremeData has written an excellent article titled To Accelerate or not to accelerate, appearing this week on embedded.com.

A key point made in this article is that high-performance computing applications, meaning such things as life sciences, financial computing, oil and gas exploration and the like, can learn quite a lot from the world of embedded systems.

Embedded systems designers already face critical issues of power vs. performance and the need to combine a wide variety of diverse processing resources. These resources include embedded processors, DSPs, FPGAs, ASSPs and custom hardware. These devices, when combined, could be thought of as a hybrid processing platform.

In the HPC world, however, the dominant approach has been to create clusters of identical processors connected by high-speed networks. Only recently have alternative accelerators such as FPGA, GPU and Cell been added to the mix, and at this point only by a small minority of HPC users and platform developers.

So… I’m in total agreement; we need to start gently applying the same knowledge and understanding of system optimizations (including power optimizations, partitioning, and an understanding of pipelining and data streaming) that is commonplace in embedded systems design, to the problems facing HPC. And of course improve the tools and libraries to make accelerated computing platforms easier for HPC programmers to pick up and use.

Leave a comment

Filed under Embed This!, Reconshmiguration

A reconfigurable tick squeezer?

This headline caught my attention today:

Vhayu Introduces Hardware Compression for Its Tick Database
Combining Vhayu Velocity with an FPGA, Squeezer compresses data by a factor of four with no performance penalty, says the vendor.

My first thoughts on seeing that headline were:

  • Is there really such a large database on ticks that hardware compression is required?
  • Would somebody actually call such a tool, “Squeezer”?
  • Is this April 1st?

On further reading, however, I realized that “tick” means “ticker”, as in market trading data.

Aha!

In fact, feed handling is a domain in which FPGAs are starting to gain significant traction. It’s all about getting the lowest possible latency: data comes directly into the system from a market feed source, and a hardware-based algorithm makes a split-second trading decision based on observed patterns. A sudden downturn in the price on key indicator stocks, for example. Or a spiking in oil prices, or a drop in the Brazilian Real, or whatever.

The trading house, hedge fund or bank that sees the pattern and reacts first is that one that wins. And so it’s a latency war out there. FPGAs represent one solution to the latency problem, and have been deployed in numerous trading-related market data appliances.

I like this quote attributed to Jeff Hudson, Vhayu CEO:

“It’s a hybrid software/hardware world we’re entering now, and those companies that embrace it will prosper and those that don’t will fall way behind.”

Indeed.

See also: High Frequency Traders Get Boost From FPGA Acceleration

1 Comment

Filed under News Shmews, Reconshmiguration

The Petaflop Playstation

Supercomputing has reached a new milestone with the announcement that IBM and Los Alamos National Labs have cracked the petaflop performance barrier.

For those who don’t speak in mops, bops, flops and no-ops, a petaflop is a measurement of performance that equates to performing one thousand trillion floating point calculations per second.

That’s a whole lot of math.

The new supercomputer, called Roadrunner, is built from commodity components including a reported 7000 standard, off-the-shelf AMD Opteron processors closely coupled to some 12,000 IBM Cell Processors.

That’s less than 20,000 commodity chips. Commodity, as in standard, widely available, non-exotic. You probably don’t have an Opteron in the computer you are using right now, but you almost certainly used one (or more likely a cluster of them) just moments ago when you asked Google, or Microsoft, or Yahoo to do a web-search for you.

As for the Cell Processor (or Cell Broadband Engine as it’s more formally known) you probably know some kid who has one of those. It’s the brains inside the Sony Playstation.

Roadrunner uses the 7,000 Opteron processors to do general-purpose grid processing, while the 12,000 Cell processors serve as specialized application accelerators.

That may sound like a lot of chips, but consider this: The previous king of supercomputers, the IBM Blue Gene, has approximately half the performance of Roadrunner but uses 212,992 processors, and presumably consumes way more power than Roadrunner.

To summarize: This is exciting news for accelerated computing. Cell Processors, GPUs and FPGAs are all proving their worth in a new, hybrid multiprocessing world. The question, of course, is how do you program these things?

Leave a comment

Filed under News Shmews

A new low-power FPGA?

The power efficiency of FPGAs is either very good, or very bad, depending on who you’re talking to. If you are comparing math performance, meaning the number of integer or floating point operations performed per watt of power, then FPGAs look pretty good when placed side-by-side against a traditional processor or DSP. This is because the FPGA has less overhead; more of the transistors can be configured as parallel structures to do the real work of processing data and computing results. FPGAs can do the same work with fewer clock cycles, resulting in lower power consumption.

If, however, you are looking at low-power portable products such as a mobile phones and games, FPGAs are not even contenders. They have a reputation as power hogs. This is because FPGAs are dominated by interconnect and have flexible, application-independent structures. It is not possible, at least not for real applications, to make use of all the transistors in an FPGA. There is leakage, and there is static power needed because of the SRAM architecture of the most common devices. And so power is wasted.

Both Xilinx and Altera have low-power FPGA families (Spartan and Cyclone, respectively) and Actel devices are also power misers. But these devices still consume too much power for use in mobile devices.

So the news that startup SiliconBlue Technologies has a new and much lower power FPGA device is notable. Their devices are small (akin to complex PLDs) but quite FPGA-like in their design.

I’m not much of an expert on FPGA process technologies, but if SiliconBlue can actually get traction in mobile devices and scale these devices up to tackle more complex algorithms… then that could be an important step for reconfigurable embedded computing.

Leave a comment

Filed under News Shmews

More FPGAs on Mars

The Phoenix Mars lander is scheduled to be on the surface of Mars this weekend, on Sunday. At least one Actel FPGA is on board, handling pressure and temperature data processing.

Note that Actel and Xilinx FPGAs are already rolling around on Mars, in the Spirit and Opportunity rovers.

Details of the Phoenix lander and its gadgets can be found on the JPL Mars Page.

Leave a comment

Filed under News Shmews

FPGA and DSP Co-Design

Our pals at 3L put together a nice demonstration using a Sundance DSP/FPGA development board. Check it out:

DSP/FPGA Co-Processing Demonstrates 20X Acceleration using Software-to-Hardware Design Flow

Leave a comment

Filed under News Shmews

Altera announces 40nm Stratix IV

Gee, it seems like just yesterday that 90nm process was a big deal…
Altera announces 40-nm Stratix IV FPGAs and HardCopy IV ASICs

Leave a comment

Filed under News Shmews

Slaughterhouse Virtex-5?

FPGAs are appearing in some, ah, interesting places. Read about Automated Meat Processing in Vision Systems Magazine.

Embedded 3-D machine-vision system guides robotic cutting of beef carcasses

The stereo vision system used in this industrial-sized nail clipper is pretty cool. The manufacturer describes it this way:

TYZX’ DeepSea G2 Stereo Vision System is a small embedded camera with low power requirements that delivers high performance visual analysis (up to 60 fps) in variable lighting conditions. Designed for widespread deployment in applications such as security and navigation, the G2 provides real-time 3D vision processing without the use of workstations…

The DeepSea G2 system is based around Xilinx FPGAs that perform image rectification, 2D and 3D quantization and other compute-intensive tasks. In a technical white paper, the system designers describe how the FPGA is combined with an ADI Blackfin processor and a custom ASIC to strike the right balance between performance and flexibility. One interesting tidbit is found in the following table, where the first and second generation products are described:

Resource table

Notice that the Pentium III processor that did much of the heavy lifting in their first-generation product completely disappears in the second generation. This example points again to the need for tools and design methods that support partitioning between many different types of processing elements.

Leave a comment

Filed under Embed This!

Floating-point performance in FPGAs

Dave Strenski of Cray and Jim Simkins, Richard Walke and Ralph Wittig from Xilinx have published an updated version of Strenski’s earlier article on floating-point performance in FPGAs.

Strenski’s analysis indicates that the biggest, baddest Xilinx FPGAs today (for floating point that would be the new Virtex-5 SX240T) have significantly higher peak performance than the biggest, baddest quad-core AMD Opteron. The article doesn’t delve into the relative power consumption of the FPGA vs the Opteron. That’s where the real story lies.

1 Comment

Filed under News Shmews

Data centers and Green IT

Here’s a reconfigurable computing blog of note I just came across. Check out this posting by Amir Hirsh about server power management: Datacenter Power Management: Subverting the Dominant Paradigm.

Last month I attended the Washington Technology Summit where power efficiency and alternative energy were big topics (along with life sciences and advanced materials). During a panel discussion on Green IT, Christian Belady of Microsoft described methods that Microsoft and its competitors use to power and cool the hundreds of thousands of CPUs that now make up a large scale data center.

During the Q&A I couldn’t help raising my hand and asking the obvious question: aren’t we attacking the wrong problem? If most of the general-purpose processors in the data centers are being applied to highly specialized problems (web-search, for example, or streaming video transcoding), they why aren’t we making more use of lower-power reconfigurable hardware?

Christian had sort of a blank look (after all, his job is watching the power meter on the building, not deciding what actually goes in all those server boxes) but another panelist, Don Tilton of Spraycool knew exactly what I was getting at. He took the microphone and opined that “GPU and FPGA accelerators are a potential solution” to the problem.

Amen.

Leave a comment

Filed under Reconshmiguration

Xilinx ML507 board: “Sweet!”

Ed is in geek heaven with his new ML507 board from Xilinx. This is a board designed to be a development and prototyping vehicle for all kinds of embedded applications. It’s got a PCI Express interface, video inputs and outputs, audio, network interface, all kinds of connectors. Too big to re-install the computer’s cover after installing it in a slot… but who cares?

If you are thinking about development using Xilinx FPGAs with an embedded PowerPC, then get yourself an ML507. And get a compatible C-to-hardware compiler while you’re at it.

 

2 Comments

Filed under Uncategorized

Filtered backprojection and FPGAs

David Pointer at SRC Computer on the use of FPGAs for Filtered Backprojection (FBP), a critical component of medical imaging and other domains:

“Projecting” images in radar and medical applications

Leave a comment

Filed under News Shmews