| THE GEORGE<br>WASHINGTON<br>UNIVERSITY<br>WASHINGTON DO                                                                                                                          |            |  |  |  |  |  |  |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------|--|--|--|--|--|--|
| High-Performance<br>Reconfigurable Computing                                                                                                                                     |            |  |  |  |  |  |  |
| Tarek El-Ghazawi                                                                                                                                                                 |            |  |  |  |  |  |  |
| Director, Institute for Massively Parallel Applications and Computing<br>Technology (IMPACT)<br>Co-Director, NSF Center for High-Performance Reconfigurable<br>Computing (CHREC) |            |  |  |  |  |  |  |
| The George Washington University                                                                                                                                                 |            |  |  |  |  |  |  |
| ICFPT07                                                                                                                                                                          | 12/11/07 1 |  |  |  |  |  |  |







| Top 500 Supercomputers |                                                                             |                                                                               |            |      |                  |                   |  |
|------------------------|-----------------------------------------------------------------------------|-------------------------------------------------------------------------------|------------|------|------------------|-------------------|--|
| Rank                   | Site                                                                        | Computer                                                                      | Processors | Year | R <sub>max</sub> | R <sub>peak</sub> |  |
| 1                      | DOE/NNSA/LLNL<br>United States                                              | eServer Blue<br>Gene Solution<br>IBM                                          | 212992     | 2007 | 478200           | 596378            |  |
| 2                      | Forschungszentrum<br>Juelich (FZJ)<br>Germany                               | Blue Gene/P<br>Solution<br>IBM                                                | 65536      | 2007 | 167300           | 222822            |  |
| 3                      | SGI/New Mexico<br>Computing Applications<br>Center (NMCAC)<br>United States | SGI Altix ICE<br>8200, Xeon quad<br>core 3.0 GHz<br>SGI                       | 14336      | 2007 | 126900           | 172032            |  |
| 4                      | Computational Research<br>Laboratories, TATA<br>SONS<br>India               | Cluster Platform<br>3000 BL460c,<br>Xeon 53xx 3GHz,<br>Infiniband<br>HP       | 14240      | 2007 | 117900           | 170880            |  |
| 5                      | Government Agency<br>Sweden                                                 | Cluster Platform<br>3000 BL460c,<br>Xeon 53xx<br>2.66GHz,<br>Infiniband<br>HP | 13728      | 2007 | 102800           | 146430            |  |
| ICFPT07                | CFPT07 12/11/07                                                             |                                                                               |            |      |                  |                   |  |





| Synergism between $\mu P$ and RPs |                                                                                                                                                                                            |  |  |  |  |  |
|-----------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|--|--|--|
| μΡ                                | RP(FPGA-based)                                                                                                                                                                             |  |  |  |  |  |
| Software→Control Flow             | Hardware→Data Flow                                                                                                                                                                         |  |  |  |  |  |
| (von Neumann)                     |                                                                                                                                                                                            |  |  |  |  |  |
| Temporal – reuse of               | Spatial – Unfolding                                                                                                                                                                        |  |  |  |  |  |
| fixed hardware                    | parallel operations with                                                                                                                                                                   |  |  |  |  |  |
|                                   | changeable hardware                                                                                                                                                                        |  |  |  |  |  |
| Coarse-Grain                      | Fine-Grain                                                                                                                                                                                 |  |  |  |  |  |
| Very Fast                         | Relatively Slow                                                                                                                                                                            |  |  |  |  |  |
| Saturating Rate                   | Increasing Speed                                                                                                                                                                           |  |  |  |  |  |
| Relatively Easy                   |                                                                                                                                                                                            |  |  |  |  |  |
| (S.W./Parallel<br>Programming)    | Harder                                                                                                                                                                                     |  |  |  |  |  |
| COTS, multipurpose                | COTS, multipurpose                                                                                                                                                                         |  |  |  |  |  |
|                                   | μP<br>Software→Control Flow<br>(von Neumann)<br>Temporal – reuse of<br>fixed hardware<br>Coarse-Grain<br>Very Fast<br>Saturating Rate<br>Relatively Easy<br>(S.W./Parallel<br>Programming) |  |  |  |  |  |



| WHAT'S NEW IN THE VIRTEX-5 FPGA FAMILY |                                                                           |                                               |                                                                    |  |  |  |
|----------------------------------------|---------------------------------------------------------------------------|-----------------------------------------------|--------------------------------------------------------------------|--|--|--|
| Feature/capability<br>LX Platform      | Virtex-5 family                                                           | Virtex-4 family                               | Virtex-5 benefit                                                   |  |  |  |
| Process<br>Technology                  | 65nm,<br>1.0v V <sub>cc</sub><br>Triple-oxide                             | 90nm,<br>1.2v V <sub>cc</sub><br>Triple-oxide | Higher density and performance with lower power and cost           |  |  |  |
| LUT                                    | Real 6-input LUT with<br>6 independent inputs                             | 4-input LUT                                   | Fewer logic levels—<br>higher density and speed<br>and lower power |  |  |  |
| Distributed RAM                        | 256 bits per CLB                                                          | 64 bits per CLB                               | More memory                                                        |  |  |  |
| Shift Registers (SRL)                  | 128-bit in one CLB                                                        | 64-bit in one CLB                             | Deeper pipelines                                                   |  |  |  |
| Interconnect                           | New diagonal routing                                                      | Segmented routing                             | Fast, predictable routing                                          |  |  |  |
| Clock<br>Management                    | 550 MHz<br>PLL and DCM                                                    | 500 MHz<br>DCM                                | Higher speed<br>PLL: lower jitter<br>DCM: flexible clock synthesis |  |  |  |
| Block RAM/FIFO<br>with ECC             | 550 MHz<br>36 Kbits per block<br>(2 x 18Kb) with power<br>saving circuits | 500 MHz<br>18 Kbits per block                 | Higher speed<br>More memory,<br>low power                          |  |  |  |
| DSP Blocks                             | 550 MHz<br>25 x 18-bit MAC, plus<br>bit-wise comparator                   | 500 MHz<br>18 x 18-bit MAC                    | Higher performance<br>Higher precision using<br>50% fewer slices   |  |  |  |
|                                        | 1.38 mW/100MHz<br>@ 38% toggle rate                                       | 2.3 mW/100MHz<br>@ 38% toggle rate            | Lower power                                                        |  |  |  |









## Reconfigurable Computing Boards (Accelerators)

- Many boards per node can be supported
- Host program (e.g. C) to interface user (and μP) with board via a board API
- Driver API functions may include functionalities such as Reset, Open, Close, Set Clocks, DMA, Read, Write, Download Configurations, Interrupt, Readback

12/11/07

15

ICFPT07













































































































































|              | SGI® RASC™ Module (Ver. 1)                                                       | SGI® RASC™ RC100 Blade                                                                                                                                                   |  |
|--------------|----------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| FPGA         | Xilinx Virtex II-6000                                                            | Xilinx Virtex-4 LX200                                                                                                                                                    |  |
| No. of FPGAs | One per brick                                                                    | Two per blade                                                                                                                                                            |  |
| Host System  | SGI® Altix® 3700 Bx2 or 350<br>Silicon Graphics Prism™                           | SGI® Altix® 4000<br>SGI® Altix® 3700 Bx2 or 350 *<br>Silicon Graphics Prism™*+                                                                                           |  |
| Memory       | 16MB QDR SRAM                                                                    | 80MB QDR SRAM <u>OR</u><br>20GB DDR2 SDRAM                                                                                                                               |  |
| I/O          | Dual NUMAlink™ 4 ports                                                           | Dual NUMAlink™ 4 ports                                                                                                                                                   |  |
| Max Config   | Up to 2 units per system                                                         | Up to 8 RC100 blades per system<br>More available with custom configuration                                                                                              |  |
| Dimensions   | Rack-Mountable Form Factor<br>•EIA slide-mountable<br>•2U (3.5" H x 19"W x 26"D) | Blade Form Factor<br>•10-U Altix® 4000 IRU<br>•Up to 8 RC100 blades per IRU<br><u>Rack-Mountable Form Factor</u><br>•2 blade slot chassis<br>•3U (5.25" H x 19"W x 26"D) |  |
| O/S          | Linux® OS (on host server)                                                       | Linux® OS (on host server)                                                                                                                                               |  |























|                          |                      |                     |                                                             |                            | 57                  | rrenza.aspx)           |
|--------------------------|----------------------|---------------------|-------------------------------------------------------------|----------------------------|---------------------|------------------------|
|                          | X86-only             | Propri              | etan                                                        | x86 Custom/<br>Proprietary | Torrenza            | Source: [In Stat, 5/07 |
| тсо                      |                      |                     |                                                             |                            | Low-Medium          |                        |
|                          | Low                  | Hig                 |                                                             | Medium-High                |                     |                        |
| Flexibility              | Low                  | Hig                 |                                                             | Medium                     | High                |                        |
| Scalability              | High                 | Lo                  | W                                                           | Medium                     | High                |                        |
| Manageability            | Hiah                 | Med                 | ium                                                         | Medium                     | Hiah                |                        |
| Performance              | Low-Medium           | Hig                 | <u></u> h                                                   | Medium-High                | Medium-High         | 1                      |
|                          | Public               |                     | za Parti                                                    | cipants                    |                     |                        |
| Company                  | Market Segment       | Coherent<br>License | Product(s) in Development                                   |                            | Source: [AMD, 5/07] |                        |
| 3Leaf Systems            | Systems              | Yes                 | Virtual I/O server                                          |                            |                     | 1                      |
| ACTIV Financial          | Software             | No                  | Market data applications                                    |                            |                     |                        |
| AMI                      | Software             | No                  | BIOS & software development tools                           |                            | ]                   |                        |
| Altera                   | Silicon              | No                  | FPGAs                                                       |                            |                     |                        |
| Bay Microsystems         | Silicon              | No                  | Network pr                                                  | ocessors                   |                     | 1                      |
| Cadence                  | Design               | No                  | IP for HT interface & design tools for 90nm, 65nm, and 45nm |                            |                     |                        |
| Celoxica                 | Software             | No                  | Software compiler, RTS, & FPGA programming tools            |                            | 1                   |                        |
| Commex Technologies      | Silicon              | No                  | Core-logic chipsets                                         |                            | ]                   |                        |
| Cray                     | Systems              | Yes                 |                                                             | ors & HPC systems          |                     |                        |
| DRC Computer             | Silicon              | No                  | Coprocess                                                   |                            |                     |                        |
| Flextronics              | Design/Manufacturing | NO                  |                                                             | nanufacturing services     |                     |                        |
| HP                       | Systems              | No                  |                                                             | th HTX slots               |                     |                        |
| IBM                      | Systems              | No                  | Systems                                                     |                            |                     |                        |
| Lattice Semiconductor    | Silicon              | No                  | FPGAs                                                       |                            | -                   |                        |
| Liquid Computing         | Systems              | NO                  | Scalable systems                                            |                            | -                   |                        |
| Microway                 | Systems              | No                  | Systems using DRC FPGAs                                     |                            | -                   |                        |
| NetLogic                 | Silicon              | No                  | NET7 content accelerator                                    |                            | -                   |                        |
| Newisys<br>Panta Systems | Systems              | NO                  | Coherent HT fabric                                          |                            | 4                   |                        |
| Phoenix Technologies     | Software             | No                  | Scalable systems                                            |                            | -                   |                        |
| Qlogic                   | Silicon              | No                  | BIOS & software development tools                           |                            | -                   |                        |
| RapidMind                | Software             | No                  | Inifiniband I/O                                             |                            | -                   |                        |
| Raza Microelectronics    | Silicon              | Yes                 | Development suite<br>MIPS-based processors                  |                            |                     |                        |
| SRC Computers            | Silicon              | No                  | FPGAs                                                       | a processors               |                     | 1                      |
| Sun Microsystems         | Systems              | No                  | Scalable sy                                                 | stems                      |                     | 1                      |
| Tarari                   | Silicon              | NO                  |                                                             | spection & media processo  | 6                   | 1                      |
| U. Mannheim              | Silicon              | Yes                 |                                                             | nce designs, HT & CHT op   |                     | 1                      |
| Xilinx                   | Silicon              | No                  | FPGAs                                                       |                            |                     | 1                      |
| XtremeData               | Silicon              | No                  | FPGAs                                                       |                            |                     | 1 12/11/07             |











|        | Current and Future of XtremeData                |        |                                                                                       |              |      |  |
|--------|-------------------------------------------------|--------|---------------------------------------------------------------------------------------|--------------|------|--|
|        | ly Company that supp<br>osen by Intel to receiv |        | tel accelerators                                                                      |              |      |  |
| Pro    | cessor Socket                                   | Module | Features                                                                              | Availability |      |  |
| AMD    | Socket E                                        |        | 2\$180                                                                                | Now          |      |  |
| AMD    | Socket F                                        |        | 2S130 and 2S180<br>32MB QDRII<br>20MB/S Mem B/W                                       | Q42007       |      |  |
| lutel  | Dual Processor                                  |        | Scalable Footprint<br>3S80E – 3S340<br>Any combo of two +<br>Bridge<br>17MB/S Mem B/W | Q42007       |      |  |
| Intel  | Multi-processor                                 |        | Scalable Footprint<br>3S80E – 3S340<br>Any combo of two +<br>Bridge<br>17MB/S Mem B/W | Q12008       |      |  |
| CFPT07 |                                                 |        |                                                                                       | 12/11/07     | - 10 |  |











| MAP Routines                                                                                                                                |                                                                                                                                    |  |  |  |
|---------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------|--|--|--|
| <ul> <li>Microprocessor side         <ul> <li>.c File</li> <li>Function prototype</li> <li>void subr(int64_t*, int);</li> </ul> </li> </ul> | <ul> <li>MAP side         <ul> <li>.mc File</li> <li>Function implementation void subr(int64_t A[], int mn)</li> </ul> </li> </ul> |  |  |  |
| <ul> <li>Allocation of MAP</li> <li>int map_allocate(int nm);</li> <li>int map_free(int nm);</li> </ul>                                     | {<br>// code goes here<br>}                                                                                                        |  |  |  |
| <ul> <li>Calling MAP function</li> <li>subr(array, mapnum);</li> </ul>                                                                      |                                                                                                                                    |  |  |  |



| Parallel Sections                                                                                                                                                                      |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <pre>#pragma src parallel sections {     #pragma src section     {         sum1 = a + b;     }     #pragma src section     {         sum2 = a - b;     }     #pragma src section</pre> |
| {                                                                                                                                                                                      |



| Data storage                                                                                              |  |  |  |  |
|-----------------------------------------------------------------------------------------------------------|--|--|--|--|
| <ul> <li>Scalar values can be stored in the "registers" –<br/>memory created on-chip from LUTs</li> </ul> |  |  |  |  |
| – float val1, val2;                                                                                       |  |  |  |  |
| <ul> <li>Arrays can be stored in OBM</li> </ul>                                                           |  |  |  |  |
| – OBM_BANK_A (AL, long long, 128)                                                                         |  |  |  |  |
| <ul> <li>– OBM_BANK_B_2_arrays (Bi, int64_t, 128,<br/>double Bd, 2048)</li> </ul>                         |  |  |  |  |
| <ul> <li>accessible as AL[i], Bi[j], Bd[k]</li> </ul>                                                     |  |  |  |  |
| • or BRAM                                                                                                 |  |  |  |  |
| – int Ci[128];                                                                                            |  |  |  |  |
| – float Cd[2048];                                                                                         |  |  |  |  |
| <ul> <li>accessible as Ci[i], Cd[j]</li> </ul>                                                            |  |  |  |  |













## What is Impulse C?

- Not a new language
  - A Subset of ISO C + a library, just like MPI
- A library of functions compatible with standard C
  - Functions for application partitioning
  - Functions for creating and configuring the application architecture
    - Functions for creating processes and streams
    - Functions for connecting streams
    - Functions for mapping into the vendor platform
  - Functions for desktop simulation and instrumentation
- A software-to-hardware compiler







## Elements of an Impulse-C Application

- main()
  - Entry point for the software side of the application
  - Configuration function
  - e.g. config()
  - Defines the parallel Impulse C processes
  - Creates streams
  - Connects stream
- co\_initialize()
  - Creates the entire application H/W architecture targeting a specific platform
- co\_execute()
  - Starts the parallel Impulse C processes
- One or more Impulse C processes
  - Define the behavior of the application, including test producer and consumer functions as required



























## **Data Driven Model**

```
(int:33, int:32) sqradd(int:32 s, int:16 a)
     {
            sum = s + sqr;
            sqr = a*a;
     { (sum, sqr);
    uint:22<30> main() //returns a list of 30 22-bit items
     {
       uint:22 prev = 1;
       uint:22 fib
                      = 1;
       uint:22<30> fibonacci = for(i in <1..30>)
         fib = fib+prev;
         prev = fib;
       } ><fib;</pre>
     } fibonacci;
ICFPT07
                                                            12/11/07 139
```



| Loops and Collections |         |            |               |
|-----------------------|---------|------------|---------------|
| Г                     |         |            | Ι             |
|                       |         | List       | Vector        |
| -                     | foreach | Pipelined  | Wide parallel |
| -                     | for     | Sequential | Unrolled      |
| -PT07                 |         |            | 12/11/07      |





















| Dual pot mode supports the following simultaneous operations:<br>Fort 1 read / Port 2 read                                                                                                                                                                                                                                                                                                                                                                                                                                    |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Port 1 web / Post 2 rad       In dual node, the claiwing are not supported simultaneously and will generate an error       Port 1 wite / Post 1 read       In Sincle Post Mode, the read and write latency is 1 cycle.       In Dual Post Mode, the read and write latency is 2 cycles.       Parameters       Manoya Name       Manoya Depth       Manoya Depth       Intail Value Valor       Intail Value Valor       Parameters |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |































































| Example:                                                                            |                   |        | Α     | С             | G     | Т    |
|-------------------------------------------------------------------------------------|-------------------|--------|-------|---------------|-------|------|
| <ul> <li>Find the best pa</li> </ul>                                                | airwise alignment | А      | 10    | -5            | 0     | -5   |
| of GAATC and C                                                                      | CATAC             | С      | -5    | 10            | -5    | 0    |
|                                                                                     |                   | G      | 0     | -5            | 10    | -5   |
| GAATC GAAT-C                                                                        | -6772-0           | т      | -5    | 0             | -5    | 10   |
| CATAC C-ATAC                                                                        | C-A-TAC           |        | Δ h   | /noth         | etica |      |
|                                                                                     |                   | s      |       |               | n ma  |      |
| GAATC- GAAT-C                                                                       | GA-ATC            | •      |       |               |       |      |
| CA-TAC CA-TAC                                                                       | CATA-C            |        |       |               |       |      |
| <ul> <li>We need a way to m<br/>quality of a candidation</li> </ul>                 |                   |        |       |               |       |      |
| quality of a cultural                                                               |                   | GAZ    | ΑT-   | C             |       |      |
| <ul> <li>Alignment scores a         <ul> <li>substitution ma</li> </ul> </li> </ul> |                   | CA     | -TA   | C             |       |      |
| <ul> <li>gap penalty</li> </ul>                                                     |                   | -5 + 1 | 0 + ? | <b>'</b> + 10 | )+?·  | + 10 |
| <ul> <li>substitution ma</li> </ul>                                                 |                   |        |       | -             | )+?·  | + ′  |















|       |       |                |           |                   | Expe                  | cted    | Mea                           | sured                       |
|-------|-------|----------------|-----------|-------------------|-----------------------|---------|-------------------------------|-----------------------------|
|       |       |                |           |                   | Throughput<br>(GCUPS) | Speedup | Throughput<br>(GCUPS)         | Speedup                     |
| FAS   | бта   | Opteron        |           | DNA               | NA                    | NA      | 0.065                         | 1                           |
| SSEAR | RCH34 | 2.4GHz         |           | Protein           | NA                    | NA      | 0.130                         | 1                           |
|       |       |                |           | 1<br>Engine/Chip  | 3.2                   | 49.2    | 3.19 → 12.2<br>1→4 Chips      | 49 → 188<br>1→4 Chips       |
|       |       | SRC            |           | 4<br>Engines/Chip | 12.8                  | 197     | 12.4 → 42.7<br>1→4 Chips      | 191 → 656<br>1→4 Chips      |
|       | 100 M | 100 MHz (32x1) | Hz (32x1) | 8<br>Engines/Chip | 25.6                  | 394     | 24.1 → 74<br>1→4 Chips        | 371 → 1138<br>1→4 Chips     |
|       |       |                |           | Protein           | 3.2                   | 24.6    | 3.12 → 11.7<br>1→4 Chips      | 24 → 90<br>1→4 Chips        |
| GWU   |       |                |           | 1<br>Engine/Chip  | 6.4                   | 98      | 5.9 → 32<br>MPI 1→6 nodes     | 91 → 492<br>MPI 1→6 nodes   |
|       |       | XD1            | DNA       | 4<br>Engines/Chip | 25.6                  | 394     | 23.3 → 120.7<br>MPI 1→6 nodes | 359 → 1857<br>MPI 1→6 nodes |
|       | 200 M | Hz (32x1)      |           | 8<br>Engines/Chip | 51.2                  | 788     | 45.2 → 181.6<br>MPI 1→6 nodes | 695 → 2794<br>MPI 1→6 nodes |
|       |       |                |           | Protein           | 6.4                   | 49      | 5.9 → 34<br>MPI 1→6 nodes     | 45 → 262<br>MPI 1→6 nodes   |













| Powe                                                                        | Power Consumption Ratio |                                                                                                                       |     |  |  |  |  |
|-----------------------------------------------------------------------------|-------------------------|-----------------------------------------------------------------------------------------------------------------------|-----|--|--|--|--|
| N - Cluster size<br>necessary to                                            | (in th<br>o obta        | e number of microprocessors)<br>in equivalent performance                                                             |     |  |  |  |  |
|                                                                             | N                       | Power consumption advantage<br>Typical reconfigurable computer vs.<br>a cluster of dual µP boards<br>containing N µPs |     |  |  |  |  |
| I/O intensive                                                               | 10                      | 4.25                                                                                                                  |     |  |  |  |  |
| applications                                                                | L00                     | 42.50                                                                                                                 |     |  |  |  |  |
| Computationally $\begin{cases} 10 \\ intensive \\ applications \end{cases}$ | 000                     | 425.00                                                                                                                |     |  |  |  |  |
| ICFPT07                                                                     |                         | 12/11/07 1                                                                                                            | 199 |  |  |  |  |

| Power Consumption Cost                          |          |     |
|-------------------------------------------------|----------|-----|
| Assumptions:                                    |          |     |
| Both systems used non-stop over a 5 year period |          |     |
| Average commercial cost of power                |          |     |
| in LA, NYC, SF, and DC: \$0.12 per kW-hour      |          |     |
|                                                 |          |     |
|                                                 |          |     |
|                                                 |          |     |
|                                                 |          |     |
| ICFPT07                                         | 12/11/07 | 200 |

| Tot     | Total cost of power over a five year period |                                       |              |  |  |  |  |  |
|---------|---------------------------------------------|---------------------------------------|--------------|--|--|--|--|--|
|         | withou                                      | ut cooling                            |              |  |  |  |  |  |
| N       | Cluster with<br>N µPs                       | Typical<br>reconfigurable<br>computer | Savings      |  |  |  |  |  |
| 10      | \$4,468                                     | \$1,051                               | \$3,417      |  |  |  |  |  |
| 100     | \$44,680                                    | \$1,051                               | \$43,629     |  |  |  |  |  |
| 1000    | \$446,800                                   | \$1,051                               | \$445,749    |  |  |  |  |  |
|         |                                             |                                       |              |  |  |  |  |  |
|         |                                             |                                       |              |  |  |  |  |  |
| ICFPT07 |                                             |                                       | 12/11/07 201 |  |  |  |  |  |

| Tot     | Total cost of power over a five year period |                                       |              |  |  |  |  |  |
|---------|---------------------------------------------|---------------------------------------|--------------|--|--|--|--|--|
|         | including cooling                           |                                       |              |  |  |  |  |  |
| N       | Cluster of N µPs                            | Typical<br>reconfigurable<br>computer | Savings      |  |  |  |  |  |
| 10      | \$11,170                                    | \$2,628                               | \$8,542      |  |  |  |  |  |
| 100     | \$111,700                                   | \$2,628                               | \$109,072    |  |  |  |  |  |
| 1000    | \$1,117,000                                 | \$2,628                               | \$1,114,372  |  |  |  |  |  |
|         |                                             |                                       |              |  |  |  |  |  |
|         |                                             |                                       |              |  |  |  |  |  |
| ICFPT07 |                                             |                                       | 12/11/07 202 |  |  |  |  |  |



|            | Number     | FPGA     | Maximum   | Saving | Factor ( µ | P:RP)  |
|------------|------------|----------|-----------|--------|------------|--------|
| Platform   | of<br>FPGA | Туре     | Frequency | Cost   | Power      | Size   |
| SRC-6      | 8          | XC2V6000 | 100MHz    | 1:200  | 1:3.64     | 1:33.3 |
| Cray XD1   | 6          | XC2VP50  | 200MHz    | 1:100  | 1:20       | 1:95.8 |
| SGI RC-100 | 6          | XC4LX200 | 200MHz    | 1:400  | 1:11.2     | 1:34.5 |
|            |            |          |           |        |            |        |

|         |                                                                                                                                                                                                 |                                                                 | gs of HPF<br>ed on SRC-6)                 |               |                |     |
|---------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------|-------------------------------------------|---------------|----------------|-----|
|         |                                                                                                                                                                                                 |                                                                 |                                           | SAVINGS       |                |     |
|         | Application                                                                                                                                                                                     | Speedup                                                         | Cost Savings                              | Power Savings | Size Reduction | 1   |
|         | Smith-Waterman<br>(DNA Sequencing)                                                                                                                                                              | 1138                                                            | 6x                                        | 313x          | 34x            |     |
|         | DES Breaker                                                                                                                                                                                     | 6757                                                            | 34x                                       | 1856x         | 203x           |     |
|         | IDEA Breaker                                                                                                                                                                                    | 641                                                             | 3x                                        | 176x          | 19x            |     |
|         | RC5(32/12/16) Breaker                                                                                                                                                                           | 1140                                                            | 6x                                        | 313x          | 34x            |     |
|         | <ul> <li>Assumptions         <ul> <li>100% cluster e</li> <li>Cost Factor μF</li> <li>Power Factor μ</li> <li>Reconfigural</li> <li>μP board (wi</li> <li>Size Factor μP</li> </ul> </li> </ul> | P: RP → 1<br>$\mu$ P: RP →<br>ble processo<br>th two $\mu$ Ps): | <b>1:3.64</b><br>r (based on SRC<br>220 W | :-6): 200 W   |                |     |
|         | <ul> <li>Cluster of 10</li> </ul>                                                                                                                                                               | 0 µPs = foui                                                    | 19-inch racks                             |               |                |     |
|         | » footprin                                                                                                                                                                                      | t = 6 square                                                    | feet                                      |               |                |     |
|         | •                                                                                                                                                                                               | •                                                               | (SRC MAPstatio                            | on™)          |                |     |
| ICFPT07 | w tootprin                                                                                                                                                                                      | t = 1 square                                                    | teet                                      |               | 12/11/07       | 205 |

| (Ba                                                                            |                                | ne Cray-XD              |                |                |
|--------------------------------------------------------------------------------|--------------------------------|-------------------------|----------------|----------------|
| A                                                                              | 6                              |                         | SAVINGS        |                |
| Application                                                                    | Speedup                        | Cost Savings            | Power Savings  | Size Reduction |
| Smith-Waterman<br>(DNA Sequencing)                                             | 2794                           | 28x                     | 140x           | 29x            |
| DES Breaker                                                                    | 12162                          | 122x                    | 608x           | 127x           |
| IDEA Breaker                                                                   | 2402                           | 24x                     | 120x           | 25x            |
| RC5(32/8/8) Breaker                                                            | 2321                           | 23x                     | 116x           | 24x            |
| <ul> <li>– 100% clus</li> <li>– Cost Factor</li> <li>– Power Factor</li> </ul> | or μΡ:RΡ<br>tor μΡ:R           | → 1 : 100<br>P → 1 : 20 |                |                |
|                                                                                | • •                            | uPs): 220 W             | n one XD1 Chas | SIS): 2200 VV  |
| •                                                                              |                                | → 1 : 95.8              |                |                |
|                                                                                |                                |                         |                |                |
| ♦ Cluster                                                                      | •                              | = four 19-inch ra       | acks           |                |
| ◆ Cluster<br>» foc                                                             | of 100 µPs =<br>otprint = 6 sc | = four 19-inch ra       |                |                |

| (Ba                                                                                   |                                                                                                                       | gs of HPF<br>e Altix 4700 1                                                                 |               |               |     |
|---------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------|---------------|-----|
|                                                                                       | <b>.</b> .                                                                                                            |                                                                                             | SAVINGS       |               |     |
| Application                                                                           | Speedup                                                                                                               | Cost Savings                                                                                | Power Savings | Size Reductio | on  |
| Smith-Waterman<br>(DNA Sequencing)                                                    | 8723                                                                                                                  | 22x                                                                                         | 779x          | 253x          |     |
| DES Breaker                                                                           | 38514                                                                                                                 | 96x                                                                                         | 3439x         | 1116x         |     |
| IDEA Breaker                                                                          | 961                                                                                                                   | 2x                                                                                          | 86x           | 28x           |     |
| RC5(32/12/16) Breaker                                                                 | 6838                                                                                                                  | 17x                                                                                         | 610x          | 198x          |     |
| <ul> <li>↓P boa</li> <li>– Size Factor</li> <li>◆ Cluster</li> <li>&gt; fo</li> </ul> | ster efficien<br>or μ <b>P : RP</b><br>ctor μ <b>P : RF</b><br>Rack: 1230 W<br>rd (with two μ<br>or μ <b>P : RP -</b> | → 1 : 400<br>→ 1 : 11.2<br>/<br>IPs): 220 W<br>→ 1 : 34.5<br>four 19-inch rack<br>uare feet | s             |               |     |
| ICFPT07 » fo                                                                          | otprint = 2.07                                                                                                        | square feet                                                                                 |               | 12/11/07      | 207 |



| Lessons Learned                                                                                                                                                                                         |     |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----|
| <ul> <li>Porting an existing code to an RC platform is difficult         <ul> <li>Requires an in-depth understanding of the code structure and data flow</li> </ul> </li> </ul>                         |     |
| <ul> <li>Code optimization techniques used in the microprocessor-based<br/>implementation are not applicable for RC implementation</li> </ul>                                                           |     |
| <ul> <li>Data flow schemes used in the microprocessor-based<br/>implementation in most cases are not suitable for RC<br/>implementation</li> </ul>                                                      |     |
| <ul> <li>Only few scientific codes can be ported to an RC platform with<br/>relatively minor modifications         <ul> <li>90% of time is spent while executing 10% of the code</li> </ul> </li> </ul> |     |
| <ul> <li>Vast majority of the codes require significant restructuring in order<br/>to be 'portable', general problems are:</li> <li>No well-defined compute kernel</li> </ul>                           |     |
| <ul> <li>Compute kernel is too large to fit on an FPGA</li> </ul>                                                                                                                                       |     |
| <ul> <li>Compute kernel operates on a large dataset or is not called too<br/>many times</li> </ul>                                                                                                      |     |
| <ul> <li>function call overhead becomes an issue</li> </ul>                                                                                                                                             |     |
| ICFPT07 12/11/07                                                                                                                                                                                        | 209 |

| Lessons Learned                                                                                                                                                                                                                                                                          |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <ul> <li>Effective use of high-level programming languages/tools, such a<br/>MAP C/Carte (SRC-6) and Mitrion-SDK/Mitrion-C (RC100), to deve<br/>code for RC platform requires some limited hardware knowledge</li> <li>Memory organization and limitations</li> </ul>                    |
| <ul> <li>Explicit data transfer and efficient data access</li> <li>On-chip resources and limitations</li> <li>RC architecture-specific programming techniques</li> <li>Pipelining, streams,</li> </ul>                                                                                   |
| <ul> <li>Most significant code acceleration can be achieved when<br/>developing the code from scratch; the code developer then has t<br/>freedom to         <ul> <li>structure the algorithm to take advantage of the RC platform<br/>organization and resources,</li> </ul> </li> </ul> |
| <ul> <li>select most effective SW/HW code partitioning scheme, and</li> <li>setup data formats and data flow graph that maps well into R platform resources</li> </ul>                                                                                                                   |



| Publications |                                                                                                                                                                                                                                                                                                                                         |  |
|--------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| •            | El-Araby, Taher, El-Ghazawi, and LeMoigne. Remote Sensing and High-Performance Reconfigurable Computing<br>(HPRC) Systems, Chapter 18 in High Performance Computing in Remote Sensing, CRC                                                                                                                                              |  |
| ٠            | El-Ghazawi, El-Araby, Huang, Gaj, Kindratenko and Buell, "The Performance Promise of High-Performance<br>Reconfigurable Computing", IEEE Computer (in press)                                                                                                                                                                            |  |
| •            | Mohamed Abouellail, Esam El-Araby, Mohamed Taher, Tarek El-Ghazawi and Gregory B. Newby, "DNA and Protein<br>Sequence Alignment with High Performance Recofigurable Systems", NASA/ESA Conference on Adaptive Hardware<br>and Systems 2007(AHS2007), August 5-8, 2007, Scotland, UK                                                     |  |
| •            | Proshanta Saha, Tarek El-Ghazawi, "Automatic Software Hardware Co-Design for Reconfigurable Computing<br>Systems", 17th International Conference on Field Programmable Logic and Applications (FPL 2007), 27-29 August<br>2007, Amsterdam, Netherlands                                                                                  |  |
| •            | E. El-Araby, I. Gonzalez, and T. El-Ghazawi, "Bringing High-Performance Reconfigurable Computing to Exact<br>Computations", to appear in the proceedings of the 17th International Conference on Field Programmable Logic and<br>Applications (FPL 2007), Amsterdam, Netherlands, 27-29 August 2007.                                    |  |
| •            | Proshanta Saha and Tarek El-Ghazawi, A Methodology for Automating Co-Scheduling for Reconfigurable Computing<br>Systems. Fifth ACM-IEEE International Conference on Formal Methods and Models for Codesign (MEMOCODE'2007),<br>Nice, May 2007.                                                                                          |  |
| •            | Proshanta Saha, Tarek El-Ghazawi, "Software/Hardware Co-Scheduling for Reconfigurable Computing Systems";<br>International Symposium on Field-Programmable Custom Computing Machines 2007 (FCCM 2007); 23-25 April 2007,<br>Napa, CA                                                                                                    |  |
| •            | Proshanta Saha, Tarek El-Ghazawi, "Applications of Heterogeneous Computing in Hardware/Software Co-scheduling<br>", International Conference on Computer Systems and Applications (AICCSA 2007), Amman, May 2007.                                                                                                                       |  |
| •            | Proshanta Saha, Tarek El-Ghazawi, "Software/Hardware Co-Scheduling for Reconfigurable Computing Systems",<br>Proceeding of III Southern Conference on Programmable Logic (SPL 2007), February 26-28, 2007 - Mar del Plata,<br>Argentina                                                                                                 |  |
| •            | Miaoqing Huang, Tarek El-Ghazawi, Brian Larson, Kris Gaj : "Development of Block-cipher Library for Reconfigurable<br>Computers", Proceeding of III Southern Conference on Programmable Logic (SPL 2007), February 26-28, 2007 - Mar de<br>Plata, Argentina                                                                             |  |
| •            | Esam El-Araby, Mohamed Taher, Mohamed Abouellail, Tarek El-Ghazawi, and Gregory B. Newby, "Comparative<br>Analysis of High Level Programming for Reconfigurable Computers: Methodology and Empirical Study", Proceeding<br>of III Southern Conference on Programmable Logic (SPL 2007), February 26-28, 2007 - Mar del Plata, Argentina |  |
| ICF          | EPT07 12/11/07 212                                                                                                                                                                                                                                                                                                                      |  |



|     | Publications                                                                                                                                                                                                                                                                                         |
|-----|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|     |                                                                                                                                                                                                                                                                                                      |
| ٠   | E. El-Araby, M. Taher, K. Gaj, T. El-Ghazawi, D. Caliga, N. Alexandridis, "System-Level Parallelism and Concurrency<br>Maximization in Reconfigurable Computing Applications", International Journal for Embedded Systems (IJES), vol. 2, no. 1/2,<br>2006, pp. 62-72.                               |
| •   | S. Kaewpijit, J. Le Moigne, and T. El-Ghazawi, "Automatic Reduction of Hyperspectral Imagery Using Wavelet Spectral<br>Analysis," IEEE Transactions on Geosciences and Remote Sensing (TGARS), Vol. 41 No. 4, April 2003, pp 863-871.                                                                |
| ٠   | T. El-Ghazawi, K. Gaj, N. Alexandridis, F. Vroman, N. Nguyen, J. Radzikowski, P. Samipagdi, and S. Suboh, "A Performance<br>Study of Job Management Systems," Concurrency and Computation: Practice and Experience, John Wiley & Sons, Ltd.                                                          |
| •   | E. El-Araby, M. Taher, T. El-Ghazawi, A. Youssif, R. Irish, and J. Le Moigne, "Performance Scalability of a Remote Sensing<br>Application on High Performance Reconfigurable Platforms", NASA Earth-Sun System Technology Conference (ESTC 2006),<br>Maryland, USA, June, 2006.                      |
| •   | E. El-Araby, M. Taher, T. El-Ghazawi, J.Le Moigne, "Prototyping Automatic Cloud Cover Assessment (ACCA) Algorithm for<br>Remote Sensing On-Board Processing on a Reconfigurable Computer,"Proc. IEEE 2005 Conference on Field Programmable<br>Technology, FPT'05, Singapore, Dec. 11-14, 2005.       |
| ٠   | J. Harkins, E. El-Araby, M. Huang, T. El-Ghazawi, "Performance of Sorting Algorithms on a Reconfigurable Computer," Proc.<br>IEEE 2005 Conference on Field Programmable Technology, FPT'05, Singapore, Dec. 11-14, 2005.                                                                             |
| ٠   | E. El-Araby, K. Gaj, T. El-Ghazawi, "A System Level Design Methodology for Reconfigurable Computing Applications," Proc.<br>IEEE 2005 Conference on Field Programmable Technology, FPT'05, Singapore, Dec. 11-14, 2005.                                                                              |
| •   | C. Shu, K. Gaj, T. El-Ghazawi , "Low Latency Elliptic Curve Cryptography Accelerators for NIST Curves on Binary Fields," Proc. IEEE 2005 Conference on Field Programmable Technology, FPT'05, Singapore, Dec. 11-14, 2005.                                                                           |
| •   | S. Bajracharya, D. Misra, K. Gaj, T. El-Ghazawi , "Reconfigurable Hardware Implementation of Mesh Routing in Number<br>Field Sieve Factorization," Extended Abstract, Talk Special Purpose Hardware for Attacking Cryptographic Systems,<br>SHARCS 2005, Paris, France, Feb. 24-25, 2005, pp. 71-81. |
| •   | E. El-Araby, T. El-Ghazawi, J.Le Moigne, and K. Gaj, "Wavelet Spectral Dimension Reduction of Hyperspectral Imagery<br>on a Reconfigurable Computer," Proc. IEEE 2004 Conference on Field Programmable Technology, FPT 2004, Brisbane,<br>Australia, Dec. 6-8, 2004, pp. 399-402.                    |
| ICF | PT07 12/11/07 214                                                                                                                                                                                                                                                                                    |

## **Publications**

- ♦ E. Chitalwala, T. El-Ghazawi, K. Gaj, N. Alexandridis, D. Poznanovic, "Effective System and Performance Benchmarking for Reconfigurable Computers," Proc. IEEE 2004 Conference on Field Programmable Technology, FPT 2004, Brisbane, Australia, Dec. 6-8, 2004, pp. 453-456.
- S. Bajracharya, C. Shu, K. Gaj, T. El-Ghazawi, "Implementation of Elliptic Curve Cryptosystems over GF(2<sup>n</sup>n) in Optimal Normal Basis on a Reconfigurable Computer," 14th International Conference on Field Programmable Logic and Applications, FPL 2004, Antwerp, Belgium, Aug 30 - Sept 1, 2004, pp. 1001-1005...
- E. El-Araby, M. Taher, K. Gaj, T. El-Ghazawi, D. Caliga, N. Alexandridis "System-Level Parallelism and Throughput Optimizations in Designing Reconfigurable Computing Applications," Reconfigurable Architecture Workshop, RAW 2004, Santa Fe, USA, Apr 26-27, 2004.
- N. Nguyen, K. Gaj, D. Caliga, T. El-Ghazawi, "Implementation of Elliptic Curve Cryptosystems on a Reconfigurable Computer," Proc. IEEE International Conference on Field-Programmable Technology, FPT 2003, Tokyo, Japan, Dec. 2003, pp. 60-67.
- E. El-Araby, M. Taher, K. Gaj, D. Caliga, T. El-Ghazawi, N. Alexandridis, "Exploiting System-level Parallelism in the Application Development on a Reconfigurable Computer," Proc. IEEE International Conference on Field-Programmable Technology, FPT 2003, Tokyo, Japan, Dec. 2003, pp. 443-446.
- A. Michalski, K. Gaj, T. El-Ghazawi, "An Implementation Comparison of an IDEA Encryption Cryptosystem on Two General-Purpose Reconfigurable Computers," LNCS 2778, 13th International Conference on Field Programmable Logic and Applications, FPL 2003, Lisbon, Portugal, Sep. 2003, pp. 204-219.
- O. D. Fidanci, D. Poznanovic, K. Gaj, T. El-Ghazawi, and N. Alexandridis, "Performance and Overhead in a Hybrid Reconfigurable Computer," Reconfigurable Architecture Workshop, RAW 2003, Nice, France, Apr. 2003.
- T. El-Ghazawi and F. Cantonnet, "UPC Performance and Potential: A NPB Experimental Study," Supercomputing'02, IEEE CS, Baltimore, Nov. 16-22, 2002.
- K. Gaj, T. El-Ghazawi, F. Vroman, N. Nguyen, J. R. Radzikowski, P. Samipagdi, and S. A. Suboh, "Performance Evaluation of Selected Job Management Systems," Proceedings of IEEE International Parallel and Distributed Processing Symposium (PMEO-PDS'02), Fort Lauderdale, Florida, Apr. 15-19, ICFP2002. 12/11/07 215