Tag Archives: pipeline

An escalator is a beautiful thing

I was traveling this month, and had the unlucky situation of needed to take six different planes (three each way) for a trip to and from Illinois. My fault for booking late, but it was an opportunity to think about latency, throughput, optimization and reconfigurable computing.

Actually I didn’t think about any of those things while strolling though the various airports. I thought mostly about where to get a beer and a burrito between flights. But that’s the same problem.

The design of an airport is, or should be, about getting people and their bags from one place to another at a predictable speed, maximizing the number of individuals who pass through on their way from there to there. And for the most part they are well designed for that purpose.

Note that I said predictable speed, not high speed. The airport managers might not care very much if takes you 5 minutes or 7 minutes to get from one gate to another. What they care about that it takes everyone pretty much the same amount of time, and that the system doesn’t get clogged up and delayed during times of peak capacity. The goal is high throughput, getting the maximum number of people through the system in a given time. This often requires a tradeoff in latency, which is the time taken for any one individual to make the journey. The same concept applies not only to airports but also to highways, to manufacturing and to pretty much every complex system out there that involves long lines of people, animals or objects requiring one or more constrained resources.

In places like Chicago O’Hare or Tokyo Narita you see this kind of throughput planning all over the place. Sometimes it doesn’t work very well, but mostly it does. You see it in the queues for checkin, in the lines for security, in the way passengers are sequenced when getting onto the planes, and in the way bags come down onto the carousels. You see it in the way you order, pay for and pick up your six dollar venti nonfat soy frappe macchiatoccino. If you could go beyond what you can see as a mere passenger, then you would find it on the taxiways and in the skies overhead, as aircraft are sequenced for landing and departure, and handed off from controller to controller as they move from sector to sector.

And what is all this? A whole lot of parallel, pipelined and interconnected systems.

Consider airport escalators for a moment. When you are on one, especially a very long one, it might seem awfully slow. But what an amazing and beautiful thing an escalator is, when you think about throughput.

I have a specific example in mind. If you take the express train from Tokyo to Narita airport, you and hundreds of other people will simultaneously arrive at a platform deep underneath the airport, somewhere around basement level 5. Everyone on that train will have one or two bags, some will have small children, some will be old people who walk slowly, some will be impatient and fast… hence it should be total chaos when the doors of that train open and people try to make their way up to the checkin lines, seven levels up. But the Narita escalators practically eliminate that chaos. They do this by forcing people to get on at a constant rate, one person, or a pair of people, for every two or three steps. At capacity there are many hundreds of people simultaneously moving up from level to level. The escalators provide a smoothing function, delivering people in an orderly manner to the lines at the counters, and at a much higher effective rate, more people per minute than could possibly be provided by elevators or stairs.

How does this relate to reconfigurable computing? Well… a non-technical friend recently asked me to explain the difference between traditional processing, using an x86 processor, for example, and parallel processing in an FPGA. I was tossing around words like “pipelining” and “throughput” without relating those concepts to the real world. Then it dawned on me that traditional, sequential processing is like an elevator, and reconfigurable processing is like an escalator.

The traditional processor is a single elevator that carries a small number of passengers (data and instructions) from one floor to another. For decades, processor vendors have focused entirely on making their elevators run faster, working to get small sequences of data and instructions from the bottom floor to the top floor as quickly as possible. They have also increased the capacity of the elevator by increasing the word size (64 bits, for example) and by adding more complex instructions. But the elevator approach has inherent limitations. At busy times of the day, an elevator becomes a bottleneck no matter how fast it runs.

If we eliminate the elevator and instead build an escalator, or better yet multiple parallel escalators, then we can move a whole lot more passengers in the same amount of time, and probably do it using a whole lot less power. No doors to open and close, not as much shuffling for position in the queue, and no snot-nosed little kid pushing every button and jamming up the system. And if our airport or system is truly reconfigurable, we can deploy exactly the right combination of parallel escalators that are needed at any given time. Shut down or eliminate unused escalators in the middle of the night, and add more of them when we expect a lot of trains or planes to come in at the same time.

And there’s something else to consider: parallel programming is sometimes described as difficult, exotic and unnatural. Something that most software programmers just aren’t ready for. Well… I dunno. Clever people have been designing lean production, package sorting and transportation systems, and shopping malls and airports, for an awfully long time and are getting pretty darn good at it. Maybe we just need to hire some of these people to write and optimize our parallel algorithms?


1 Comment

Filed under Reconshmiguration

Threads, pipelines and the demise of Moore’s Law

I came across an interview with Donald Knuth from June of this year, in which he throws some cold water on the current trend toward multicore computers. An excerpt:

…I might as well flame a bit about my personal unhappiness with the current trend toward multicore architecture. To me, it looks more or less like the hardware designers have run out of ideas, and that they’re trying to pass the blame for the future demise of Moore’s Law to the software writers by giving us machines that work faster only on a few key benchmarks! I won’t be surprised at all if the whole multithreading idea turns out to be a flop…

Strong words, but there’s a little tidbit later in the article suggesting where Knuth’s sympathies really are:

…So why should I be so happy about the future that hardware vendors promise? They think a magic bullet will come along to make multicores speed up my kind of work; I think it’s a pipe dream. (No—that’s the wrong metaphor! “Pipelines” actually work for me, but threads don’t. Maybe the word I want is “bubble.”)

I can relate to that. Being an old-school, PDP-11 era C programmer I never really grasped the intricacies of synchronization and how to write a decent threaded algorithm. Threading was never intuitive for me. I understood it at an abstract level, but it always felt like thread libraries and APIs were an awful invention, forcing programmers to contort their code in ways that rarely matched the actual application.

I have far less trouble wrapping my head around streaming and pipelining. The model of having multiple processes independently processing streams of data is intuitive because it exists everywhere around us, in the real world (see my next posting on that). Even very complex systems with nested pipelines of varying rates can be understood conceptually by programmers, and by non-programmers as well.

Over on Dobbs Code Talk there’s a blog post from James Reinders titled Pipelines/Streams offer easy parallel programming. In the posting Reinders offers the following concepts:

The “magic” which makes this all so easy for parallel programming comes from three things:

  1. to be parallel you need independent work to run in parallel: if you pipeline your work (streaming data) and you have no interdependencies other than the data streams themselves (no global side-effects) you get exactly that: independent work to run in parallel
  2. the pipeline stages themselves can be broken up to run in parallel by either data parallelism, or possibly a pipeline of their own (so nested parallelism is important)
  3. the very very sticky problem of data placement, which becomes a more and more severe problem in the future, is solved implicitly (the migration of data is very clear and very clean)

The above makes parallel programming using pipelined processes and streaming data seem rather simple and obvious. We do a lot of that kind of programming around here (using Impulse C) and yes, it’s a very good approach when targeting massively parallel architectures such as FPGAs. Maybe it’s the only practical method at the moment. Personally I would not characterize parallel programming and the design of highly pipelined algorithms as “easy”, but tools available today, including ours, make it practical for software programmers to write such programs and target non-traditional computing devices. The analysis and optimization of deeply pipelined, high-performance applications is still a significant challenge, but this challenge can be met with improved tools, and with the more intuitive programming models that streaming and pipelining represent.

Leave a comment

Filed under Reconshmiguration