January 19, 2009

FPGAs quietly turn 25

This year, 2009, will mark the 25th anniversary of Xilinx. Altera will turn 26, and Actel 24. (Lattice Semiconductor, also an FPGA maker, has a longer industry history than any of these but didn’t have FPGA devices until the early 1990s.)

David Manners of Electronics Weekly asks this week, Does the FPGA need a re-design?

Instead of narrowing the market applicability of its products by focusing on specific customer needs, the programmable logic industry would do better to address its fundamental problems: FPGAs use too much power and are too expensive.

It is certainly true that FPGAs are not optimal for any specific application. Lack of application optimization means there is going to be overhead – the cost of providing flexibility in the device. Overhead in size/cost, and overhead in power.

There have been quite a few alternative programmable architectures proposed and funded that would, in theory, reduce this overhead through clever interconnect strategies and through dynamic, run-time reconfiguration. And there have been an almost equal number of casualties along the way. Companies that have entered and exited this space include MorphICs, Chameleon, IP Flex, PACT, Quicksilver, Mathstar… the list goes on.

Do we need a fundamentally different kind of FPGA? Probably. But the lesson of the somewhat battered reconfigurable computing industry seems to be, take it slowly and don’t change too many things at the same time.

For example, you can change the device architecture and its mix of internal resources to reduce its power consumption, without fundamentally changing how it’s programmed. Or you can change the programming method to increase design productivity, but don’t try at the same time to dramatically change the underlying architecture.

The reason? Design teams considering FPGAs (or FPGA-like reconfigurable devices) are risk-averse. The most appealing platform for these teams will have well-proven methods of programming and the safety of a broader market – the economies of scale.

Even in their adult years, FPGAs as we know them continue to provide a competitive level of performance (relative to, say, a DSP device), with reasonably low risk to the average design team. And so they will be with us for quite a while, even with their drawbacks.

January 7, 2009

Finding Nemo at CES 2009

nemox1

When producing higher-level design tools for FPGAs, it’s important to know the true “pain points” of application developers – what aspects of software-to-hardware are most critical in actual projects, and what barriers there are to success with a given tool flow. There is no better way to learn this than by actually completing a project yourself, with your own tools, on a tight schedule. 

In software circles this is known as “eating your own dog food”.

About a week before the holidays I had a phone call from our friends at Xilinx, asking if we would like to participate in the Consumer Electronics Show in Las Vegas, highlighting the use of FPGAs for HD video processing.

That sounded like a great opportunity. The trouble was, we didn’t have a CES-quality demo ready to go. Something that would suggest how low-power Xilinx Spartan FPGAs could be used in the newest and grooviest consumer and automotive devices. Something other than the usual, yawn-inducing edge detection filters, decompression engines or picture-in-a-picture demos.

As an organization we’ve done a fair amount of video processing work, most of it customer-funded and military/aerospace related. We’ve created configurable filters, combined dual embedded processors with custom video coprocessors, streamed video between TI DSPs and FPGAs, and all kinds of other fun stuff. There are some folks on our team who are real hotshots at this kind of thing.

But what could we build in two or three weeks that would be really fun and different? Could we use the new Xilinx Video Starter Kit, with its DVI input and output interfaces and its Embedded Development Kit (EDK) reference designs, and actually put something together in time?

To make this more interesting, it was two weeks before the holidays, and we were already jammed up with critical customer deliverables. There was nobody available who was actually qualified to do the work. Everyone was busy.

There was only me.

By way of background: I have significant past experience with VHDL, primarily as a synthesis tools developer. And I have plenty of past experience with C programming. But to be honest, I have not written a production-quality line of code in a very long time. I’m in more of a marketing and executive management role. We have very good engineers here who probably laugh at my feeble, occasional efforts to help with new features and bug fixes.

My exposure to the modern Xilinx tools is relatively limited. I know just enough to stumble through our own Impulse CoDeveloper / EDK tutorials, by carefully following the instructions that our more expert staff have written.

I talk a good story, but I am not by any means a professional FPGA developer.

This is all a long-winded way of saying that when we received the new VSK package from Xilinx (a loaner sent the very next day) we had very little time in the lab to bring up a baseline project and begin coding an Impulse C demonstration example for it. Mostly the work would have to happen over the holiday break. In my dining room at home. With curious kids and an impatient spouse hovering nearby. (”So, this is a week off?”)

The demo I had in mind was object recognition in a 720p video stream. One evening I had come across a copy of Finding Nemo in the stack of DVDs, and it had occured to me that Nemo was probably not so hard to pick out of a video frame in real-time, given his bright colors and stripes. Could I actually ”Find Nemo” using the Xilinx hardware and Impulse C, starting with a Xilinx DVI reference example?

Fortunately the bring-up of the EDK reference examples was painless. The ACE files provided in the Flash card worked flawlessly with multiple input and output devices, verifying the hardware setup within minutes. The provided Platform Studio project for DVI passthrough built in EDK (the Xilinx Platform Studio environment), downloaded and came up perfectly the first try. I also tested the camera-input reference example and it worked, although I did not make use of that reference example for this project. (A live-action “Clown Fish in a Fishbowl” demo would have been nice… perhaps next time.)

My first effort (with the help of Mei Xu, Applications Engineer here at Impulse) was to hack into the reference example and its System Generator 2D FIR code, to attempt to insert a pre-existing Impulse C 5X5 filter in place of the FIR filter. This was moderately successful (we had edge-enhanced video output with less than two days of effort) but not particularly impressive from a functionality standpoint. Or as a high-level method of design. As I said earlier, edge detect has been done by everyone, and it’s not all that interesting. And to be honest, the HDL code that was provided in the reference example (created apparently by Xilinx System Generator) was rather obscure and probably very difficult to follow for a software person. We don’t really want Impulse C users having to muck around in that stuff.

It was then that I got the Finding Nemo idea. Mei (who had quickly gotten the edge detect working) said something like “good idea, good luck with it” and promptly left on holiday. Joe at Xilinx had also made a comment during our first call, something like “you guys should generate a pcore from your compiler”. That seemed like a good idea for productizing the design method, though maybe a bit of extra work setting up (a few days as it turned out… we already generate EDK pcores for use with MicroBlaze and PowerPC embedded processors).

During development in the subsequent week at home, I spent nearly all my development time using C and wrote no application HDL, apart from some trivial wrapper code customization for the pcore generation (based on our existing MicroBlaze PSP). 

The project was a complete success. Once I had a reliable video generation setup and had build the reference example it was mostly smooth sailing, with just the expected process of debugging and testing C code using GCC, compiling to RTL with our tools, optimizing and synthesizing, downloading and testing… and repeating this process until the demonstration was working to an acceptable level. I estimate that, not counting the time waiting for place and route to complete, I spent a total of 20 hours on the actual demo coding and testing, and then perhaps that amount in addition refining the design to make the smoothly moving ”spotlight” effect that is shown in the screen capture image above.

A block diagram of the system is shown below:

edk_blockdiagramx1

The demonstration will be shown tomorrow and Friday at CES. There is certainly more that can be done in this application, such as providing run-time configuration from MicroBlaze, improving support for alternate resolutions, and using more intelligent pattern recognition methods. But given the time constraints and the number of tools and hardware “fiddly bits” in the complete system, the speed of bring-up was impressive and encouraging to say the least. We intend to leverage the VSK in our own product promotions.

Next step: finding my car keys.

Thank you, Xilinx!

November 29, 2008

PACT files suit against Avnet and Xilinx

In the continuing and messy, decades-old saga of programmable logic litigation, we now have this news from EE Times:

German processor firm alleges Xilinx, Avnet infringe patents

PACT is a reconfigurable computing company that has worked for eight years to promote and sell its XPP reconfigurable device technology, most recently focusing on intellectual property (IP) licensing for HD video applications. The lawsuit appears to be aimed squarely at the DSP48 blocks that now appears in every FPGA family sold by Xilinx. Obviously there is a lot at stake here for Xilinx if the suit has merit. But don’t hold your breath for a result: the case is not scheduled in court until 2011.

November 26, 2008

Supercomputing 2008

The Supercomputing conference this year was in Austin, TX. I was down there for a reconfigurable computing workshop on Monday and spent a day and a half in the exhibit hall.

The workshop on Monday had 100 or so attendees, with a mix of academic papers and FPGA industry presentations and panels. The key takeaway for me was that reconfigurable computing for HPC applications (scientific computing, bioinformatics, finance, etc) is now well-proven but there are still issues with platform and tool maturity that we all need to work out before RC is mainstream. GPU-based acceleration has gained traction fast because GPUs are viewed as safe and widely available. Everyone has a graphics card of some sort, so using one to accelerate algorithms does not seem exotic. That’s not yet the case with FPGAs.

Steve Wallach presented the Convey technology in a Monday morning session. Wallach also won the Seymour Cray award this year. For those who don’t know his history, Wallach was a character in Tracy Kidder’s “Soul of a New Machine” as well as a key innovator at Convex and HP. I like the Convey approach because it emphasizes the use of commodity processors, commodity FPGAs, and well-understood C-language and Fortran programming flows. The FPGAs in the system are used to implement accelerator “personalities” that are described by Convey as follows:

“Personalities are extensions to the x86 instruction set that are implemented in hardware and optimize performance of specific portions of an application. For example, a personality designed for seismic processing may implement 32-bit complex arithmetic instructions — and at performance levels well beyond that of a commodity processor.”

The goal appears to be a library-based approach, in which developers do not directly program the FPGAs. Convey indicates that a Personality Development Kit will be available, presumably for more advanced users or system integrators.

siXis Technology was also presented during the workshop. Their platform is called the SX1000 and is a stacked FPGA module architecture with a toroidal interconnect. Very dense, with 16 or more high-end FPGAs in a small cube. Very cool. Or very hot, as the case may be. If there is a secret sauce to the siXis approach, it must be in the sauce they use to cool all those FPGAs stacked so close together.

siXis SX1000 FPGA-based supercomputer

siXis SX1000 FPGA-based supercomputer

NVIDIA had a larger presence this year than last. The TESLA platform is getting a lot of attention. But I also overheard more than a few comments regarding Intel Larrabee being a possible NVIDIA killer. And AMD is making a lot of noise (including large advertisements in WSJ) about Fusion. 2009 could see a knock-down fight between Intel, AMD and NVIDIA for acceleration business. I’m not making any predictions here.

The Pico Computing booth featured two Impulse demos including an edge detection video demo, and our 16-FPGA options valuation demonstration. The options demo looked very nice, with 16 graphs generated by the 16 FPGAs, each of which was running an accelerated Monte Carlo simulation. Check it out:

Options valuation running on 16 FPGAs using Pico Computing EX-300

Options valuation running on 16 FPGAs using Pico Computing EX-300

November 20, 2008

Ambric shuts down

This was a surprisingly fast end to a company that looked like a rare survivor amidst the scattered carcasses of reconfigurable computing startups. Ambric actually had working systems, had customer wins, over $30M in revenue (according to VentureWeb) and academic/research partners that believed in the technology. What they apparently lacked was the confidence and patience of the investment community, which had put in something like $36M to-date and were perhaps spooked by the $200M failure of MathStar, a Portland-area reconfigurable computing neighbor, earlier this year.

If there’s any lesson in this, it’s that the funding environment for new reconfigurable computing devices is getting challenging, and the best bets are on companies making use of established FPGA devices, imperfect though FPGAs may be for RC.

October 9, 2008

Stars of the tiny screen

Last month Michael Kreeger and I put together a 4-part YouTube demonstration using a Pico Computing EX-300 card. The demo walks through the basics of Impulse C, shows how the tools are used for software simulation and hardware generation, and shows how to download and run a finished FPGA bitmap on the card.

Pico Computing EX-300 - 16 FPGAs on one PCI Express card

Pico Computing EX-300 - 16 FPGAs on one PCI Express card

(The YouTube video is not very high resolution, making the screen text difficult to read. A better version with higher-res video and better sound can be found here.)

This demonstration shows how PCI Express FPGA cards are making prototyping of high performance, hardware-accelerated algorithms a whole lot easier than in the past. I predict we’ll be doing more YouTube tutorials in the future, and showing some rather impressive acceleration projects as FPGA-based platforms continue to evolve.

September 1, 2008

Progeniq accelerates animation rendering

There is a press release from Progeniq on the wires today, announcing their new RenderBoost product for higher performance CGI and image effects rendering.

Progeniq is a Singapore-based company focused on FPGA-based reconfigurable computing applications. They previously announced a product called BioBoost oriented toward bioinformatics acceleration. By following BioBoost with a product in a very different target industry, Progeniq appears to be establishing themselves as application-independent reconfigurable computing experts.

But why image rendering? Aren’t there some heavy players in that already, most notably Mental Images and Pixar?

There’s no question that image rendering in the entertainment industry is a big deal. Applications such as Pixar’s Renderman/PRMan are typically run on clusters and perform such things as solid modeling, ray tracing, shading, texturing, hidden object removal and many other tasks. These tasks can be parallelized and scaled up on multiple processors, but nonetheless the rendering of a single frame of a movie such as Rattatouie or The Incredibles can require vast amounts of CPU time. In fact it has been reported that Rattatouie required over 1500 CPU-years to render.

Obviously, anything that can speed image rendering is of great interest to the business. Not just the movie business, but also the advertising industry as well as animated solid models for engineering, science, and medical applications. But what might make this particular industry most attractive to a company like Progeniq is the existence of standards (de facto and real) and the opportunity to hook into existing tools via APIs and animation workflow tools.

There’s an important lesson here, I think. Any dramatically new technology, even one based on commodity devices such as FPGAs, is much easier to promote if it ties into existing and well-understood tools and workflows. If it can be as simple as plugging in an extra box or a PCI Express card and adding a few simple API calls to an existing application… well that seems like a darned easy sale to make. Good luck to Progeniq!

August 28, 2008

Intel announces they are going to Tukwila*

I’m guessing the marketing folks down in Portland are not so familiar with Seattle-area urban lore…

Intel Stacks Up the Chips for Your Future PC

“The latest chip in Intel’s Itanium family for server and high-performance computing systems, Tukwila, is expected to be delivered late this year to servermakers. Systems using the chip will ship in late 2009.

“Tukwila is a 65-nanometer chip and the first quad-core member of the Itanium product family. The processor has more than 2 billion transistors on it.”

For a Seattle-area inside joke, see also: Urban Dictionary

July 21, 2008

An escalator is a beautiful thing

I was traveling this month, and had the unlucky situation of needed to take six different planes (three each way) for a trip to and from Illinois. My fault for booking late, but it was an opportunity to think about latency, throughput, optimization and reconfigurable computing.

Actually I didn’t think about any of those things while strolling though the various airports. I thought mostly about where to get a beer and a burrito between flights. But that’s the same problem.

The design of an airport is, or should be, about getting people and their bags from one place to another at a predictable speed, maximizing the number of individuals who pass through on their way from there to there. And for the most part they are well designed for that purpose.

Note that I said predictable speed, not high speed. The airport managers might not care very much if takes you 5 minutes or 7 minutes to get from one gate to another. What they care about that it takes everyone pretty much the same amount of time, and that the system doesn’t get clogged up and delayed during times of peak capacity. The goal is high throughput, getting the maximum number of people through the system in a given time. This often requires a tradeoff in latency, which is the time taken for any one individual to make the journey. The same concept applies not only to airports but also to highways, to manufacturing and to pretty much every complex system out there that involves long lines of people, animals or objects requiring one or more constrained resources.

In places like Chicago O’Hare or Tokyo Narita you see this kind of throughput planning all over the place. Sometimes it doesn’t work very well, but mostly it does. You see it in the queues for checkin, in the lines for security, in the way passengers are sequenced when getting onto the planes, and in the way bags come down onto the carousels. You see it in the way you order, pay for and pick up your six dollar venti nonfat soy frappe macchiatoccino. If you could go beyond what you can see as a mere passenger, then you would find it on the taxiways and in the skies overhead, as aircraft are sequenced for landing and departure, and handed off from controller to controller as they move from sector to sector.

And what is all this? A whole lot of parallel, pipelined and interconnected systems.

Consider airport escalators for a moment. When you are on one, especially a very long one, it might seem awfully slow. But what an amazing and beautiful thing an escalator is, when you think about throughput.

I have a specific example in mind. If you take the express train from Tokyo to Narita airport, you and hundreds of other people will simultaneously arrive at a platform deep underneath the airport, somewhere around basement level 5. Everyone on that train will have one or two bags, some will have small children, some will be old people who walk slowly, some will be impatient and fast… hence it should be total chaos when the doors of that train open and people try to make their way up to the checkin lines, seven levels up. But the Narita escalators practically eliminate that chaos. They do this by forcing people to get on at a constant rate, one person, or a pair of people, for every two or three steps. At capacity there are many hundreds of people simultaneously moving up from level to level. The escalators provide a smoothing function, delivering people in an orderly manner to the lines at the counters, and at a much higher effective rate, more people per minute than could possibly be provided by elevators or stairs.

How does this relate to reconfigurable computing? Well… a non-technical friend recently asked me to explain the difference between traditional processing, using an x86 processor, for example, and parallel processing in an FPGA. I was tossing around words like “pipelining” and “throughput” without relating those concepts to the real world. Then it dawned on me that traditional, sequential processing is like an elevator, and reconfigurable processing is like an escalator.

The traditional processor is a single elevator that carries a small number of passengers (data and instructions) from one floor to another. For decades, processor vendors have focused entirely on making their elevators run faster, working to get small sequences of data and instructions from the bottom floor to the top floor as quickly as possible. They have also increased the capacity of the elevator by increasing the word size (64 bits, for example) and by adding more complex instructions. But the elevator approach has inherent limitations. At busy times of the day, an elevator becomes a bottleneck no matter how fast it runs.

If we eliminate the elevator and instead build an escalator, or better yet multiple parallel escalators, then we can move a whole lot more passengers in the same amount of time, and probably do it using a whole lot less power. No doors to open and close, not as much shuffling for position in the queue, and no snot-nosed little kid pushing every button and jamming up the system. And if our airport or system is truly reconfigurable, we can deploy exactly the right combination of parallel escalators that are needed at any given time. Shut down or eliminate unused escalators in the middle of the night, and add more of them when we expect a lot of trains or planes to come in at the same time.

And there’s something else to consider: parallel programming is sometimes described as difficult, exotic and unnatural. Something that most software programmers just aren’t ready for. Well… I dunno. Clever people have been designing lean production, package sorting and transportation systems, and shopping malls and airports, for an awfully long time and are getting pretty darn good at it. Maybe we just need to hire some of these people to write and optimize our parallel algorithms?

July 20, 2008

Threads, pipelines and the demise of Moore’s Law

I came across an interview with Donald Knuth from June of this year, in which he throws some cold water on the current trend toward multicore computers. An excerpt:

…I might as well flame a bit about my personal unhappiness with the current trend toward multicore architecture. To me, it looks more or less like the hardware designers have run out of ideas, and that they’re trying to pass the blame for the future demise of Moore’s Law to the software writers by giving us machines that work faster only on a few key benchmarks! I won’t be surprised at all if the whole multithreading idea turns out to be a flop…

Strong words, but there’s a little tidbit later in the article suggesting where Knuth’s sympathies really are:

…So why should I be so happy about the future that hardware vendors promise? They think a magic bullet will come along to make multicores speed up my kind of work; I think it’s a pipe dream. (No—that’s the wrong metaphor! “Pipelines” actually work for me, but threads don’t. Maybe the word I want is “bubble.”)

I can relate to that. Being an old-school, PDP-11 era C programmer I never really grasped the intricacies of synchronization and how to write a decent threaded algorithm. Threading was never intuitive for me. I understood it at an abstract level, but it always felt like thread libraries and APIs were an awful invention, forcing programmers to contort their code in ways that rarely matched the actual application.

I have far less trouble wrapping my head around streaming and pipelining. The model of having multiple processes independently processing streams of data is intuitive because it exists everywhere around us, in the real world (see my next posting on that). Even very complex systems with nested pipelines of varying rates can be understood conceptually by programmers, and by non-programmers as well.

Over on Dobbs Code Talk there’s a blog post from James Reinders titled Pipelines/Streams offer easy parallel programming. In the posting Reinders offers the following concepts:

The “magic” which makes this all so easy for parallel programming comes from three things:

  1. to be parallel you need independent work to run in parallel: if you pipeline your work (streaming data) and you have no interdependencies other than the data streams themselves (no global side-effects) you get exactly that: independent work to run in parallel
  2. the pipeline stages themselves can be broken up to run in parallel by either data parallelism, or possibly a pipeline of their own (so nested parallelism is important)
  3. the very very sticky problem of data placement, which becomes a more and more severe problem in the future, is solved implicitly (the migration of data is very clear and very clean)

The above makes parallel programming using pipelined processes and streaming data seem rather simple and obvious. We do a lot of that kind of programming around here (using Impulse C) and yes, it’s a very good approach when targeting massively parallel architectures such as FPGAs. Maybe it’s the only practical method at the moment. Personally I would not characterize parallel programming and the design of highly pipelined algorithms as “easy”, but tools available today, including ours, make it practical for software programmers to write such programs and target non-traditional computing devices. The analysis and optimization of deeply pipelined, high-performance applications is still a significant challenge, but this challenge can be met with improved tools, and with the more intuitive programming models that streaming and pipelining represent.