Reconfigurable Acceleration for Monte Carlo based Financial Simulation


The Chinese University of Hong Kong
Cluster Technology Ltd., Hong Kong*
Electrical Engineering Department, UCLA, USA**
Department of Computing, Imperial College, UK***

IEEE FPT December 2005
Talk outline

• achievements
• motivation
• financial introduction
• Monte Carlo (MC) & the BGM model
• new generic MC architecture
• BGM core architecture
• performance evaluation
• future work
• summary
Achievements

1. hardware accelerator for Monte Carlo (MC) simulation: on-chip processor + reconfigurable logic
2. generalized number system: simulate and optimize designs
3. apply this MC simulation to support Brace, Gatarek and Musiela (BGM) interest rate model
4. efficient Gaussian random number generators, fast division techniques
5. XC2VP30FPGA at 50MHz MC hardware design: 25x Pentium4 1500MHz
Motivation

• a cap gives the holder the right to stick with a specified rate; how much is it? (bank → holder)
• use Monte Carlo simulation to find this cap value
  – requires a large number of randomized runs
  – computation speed is the major barrier
• previous work about MC accelerations focus on
  – biochemical simulation
  – heat transfer, physics simulations
• this work describes MC acceleration and its application in financial modeling
Monte Carlo introduction

- randomly generate values for uncertain variables over and over to simulate a model
- useful for obtaining numerical solutions to analytically difficult problems
- MC can discretize a probability space, using finite samples and find the average solution
- standard error reflects the accuracy of the mean value
  - batch size will vary this standard error value
  - random paths are divided into batches for the standard error
  - e.g. cap value = $10, standard error = 10 cents

Model simulation path
simulation path
simulation path
simulation path
simulation path
simulation path
simulation path
simulation path
simulation path
simulation path

find the mean = numerical solution = value of a cap

standard error value
Financial engineering introduction

- **Fn**: forward rate (we can lock in at a later date)
- call option: holder has the right to buy asset at a price by a certain date
- **BGM**: a popular model to price interest rate derivates (e.g. caps, swap options, bond options)

\[ \tau = \frac{1}{4} \]
Cap valuation: simplified example

• suppose principal = $1000
• cap rates: 5% in $[T_0, T_1]$, 6% in $[T_1, T_2]$, 7% in $[T_2, T_3]$
• note that we do not know the value of forward rate
  – suppose they are 7%, 7.2%, 7.5%
  – payoff = $1k \times ((7\%-5\%)\times0.25\text{year}\times P(t_0, T_0)+ (7.2\%-6\%) \times 0.25\text{year}\times P(t_0, T_1)+ (7.5\%-7\%) \times 0.25\text{year}\times P(t_0, T_2))$
  – $P(t_1, t_2)$ is the discounting factor from $t_1$ to $t_2$
  – if \textbf{cap value} = this payoff value (\textit{accurate}?)
  – this is the reason why we need MC & the BGM model

\[ \tau = 1/4 \]
The BGM model

- based on a stochastic differential equation (SDE)
- price of a cap is computed by using MC simulation
- BGM paths are generated according to the following equation
- involve vector products, divisions
- Noise term: $d\tilde{W}(t)$  
  environment calibration: $\tilde{\sigma}_n(t)$

\[
\frac{dF_n(t)}{F_n(t)} = (\tilde{\mu}_n(t) + \tilde{\sigma}_n(t) \cdot d\tilde{W}(t)) dt + \tilde{\sigma}_n(t) \cdot d\tilde{W}(t)
\]

Step 1: for $n = \text{CurrPeriod} + 1$ to $N$

\[
f_{\text{actor}} = \frac{\tau_n F_n}{1 + \tau_n F_n}
\]

Step 2:

\[
\tilde{\mu}_n = f_{\text{actor}} \times \tilde{\sigma}_n
\]

Step 3:

\[
\tilde{\mu}_n = \tilde{\mu}_n + \tilde{\mu}_{n-1}
\]

Step 4:

\[
\kappa = \left(\tilde{\mu}_n \cdot \tilde{\sigma}_n\right) dt + (d\tilde{W} \cdot \tilde{\sigma}_n)
\]

Step 5:

\[
dF_n = \kappa \times F_n
\]

Step 6:

\[
F_n = F_n + dF_n
\]
Our MC architecture

four components:
• random number generators
• MC simulation core
• post processing
• microprocessor core for control logic and host interface

e.g. calculate payoff values (caplet, swaplet)
e.g. final cap pricing, standard error value
Gaussian random number generation

- consists of 4 stages
- use LFSR to generate uniform random numbers
- function evaluations of \( f, g_1 \) and \( g_2 \) and multiplications – piecewise approximation
- accumulation step to overcome the quantization and approximation errors
- noise generation at one sample per clock cycle
MC architecture – using BGM

MC simulation core
• BGM design

int MC_Simulate_Batches(int NumBatch)
{
    /* Simulate batches */
    for(n=0; n<NumBatch; n++) {
        mc_GenPath(Data);
        · · · · ·
    }
    /* Resulting mean and standard error */
    · · · · ·
}
BGM core architecture

- mapping the pseudo code into hardware
- fully pipelined design
- generate $F'_n$

Step 1: for $n = CurrPeriod + 1$ to $N$
Step 2: $factor = \tau_n F_n / (1.0 + \tau_n F_n)$
Step 3: $\tilde{\mu}_n = factor \times \tilde{\sigma}_n$
Step 4: $\tilde{\mu}_n = \tilde{\mu}_n + \tilde{\mu}_{n-1}$
Step 5: $\kappa = (\tilde{\mu}_n \cdot \tilde{\sigma}_n) dt + (dW \cdot \tilde{\sigma}_n)$
Step 6: $dF_n = \kappa \times F_n$
Step 7: $F_n = F_n + dF_n$
BGM core – Step 1 to 3

- initialize the Brownian motion parameter ($dW$), volatility vector ($sigma$), forward rate ($F_n$)
- “vector” blocks convert $dW$ and $sigma$ to scalars

Step 2: \[ \text{factor} = \sqrt{F_n/(1.0 + \tau_n F_n)} \]

Step 3: \[ \tilde{\mu}_n = \text{factor} \times \tilde{\sigma}_n \]
BGM core:  

Step 4:  

\[ \tilde{\mu}_n = \tilde{\mu}_n + \tilde{\mu}_{n-1} \]

- a vector addition
- depth of FIFO is decided by the number of BGM paths
BGM core – Step 5 & 6

- two vector multiplications (dot product)
- $x_1y_1 + x_2y_2 + x_3y_3$

\[
\kappa = (\vec{\mu}_n \cdot \vec{\sigma}_n) \, dt + (\vec{dW} \cdot \vec{\sigma}_n)
\]

\[
dF_n = \kappa \times F_n
\]
2-D data flow of the BGM simulation

- MC simulation generates a set of independent random forward rate paths
- different pipelined stage computes a different path
- resolve the data dependencies between different paths

M: the number of paths
N: the number of standard forward rates
Revisit: Cap pricing

- suppose principal = $1000
- cap rates: 5% in \([T_0,T_1]\), 6% in \([T_1,T_2]\), 7% in \([T_2,T_3]\)
- note that we do not know the value of forward rate
  - suppose they are 7%, 7.2%, 7.5%
  - payoff = 1k * ((7%-5%)*0.25year*P(t_0,T_0)+(7.2%-6%)*0.25year*P(t_0,T_1)+(7.5% -7%)*0.25year*P(t_0,T_2))
  - \(P(t_1,t_2)\) is the discounting factor from \(t_1\) to \(t_2\)
  - if cap value = this payoff value (accurate?)
  - this is the reason why we need MC & the BGM model

\[ \tau = 1/4 \]
Post-processing – Cap pricing

• implement this equation for cap pricing

\[ \text{payoff}_n = \text{principal} \times \tau_n \times \max(F_n(t_n) - \text{cap rate}, 0) \]
Program running on the PowerPC

• calculate the means and standard errors of the randomised trial runs

each batch has 50 paths
(e.g. 5000 total paths)

Batch Mean

... ...

Batch Mean

SumBatchMean / NumBatch

Mean Standard Error

/* Simulate batches */

for (k = 0; k < NumBatch; k++) {
    bgm_GenPath(bgmData);
    SumBatchMean += bgmData;
    SumSqBatchMean += bgmData * bgmData;
}

/* Calculate the resulting mean and standard error */

Mean = SumBatchMean / NumBatch;
SqMean = sqrt((SumSqBatchMean –
              SumBatchMean * SumBatchMean / NumBatch) / (NumBatch – 1.0) / NumBatch);
Performance evaluation

- use Xilinx ML310 system, XC2VP30 device, with two embedded PowerPC cores
- compare with a P4 1.5GHz machine
  - GCC 3.3 compilation with -O3 optimization
- experimental results show a scalable speedup to the software implementation
- NumPath-per-Batch = 50

<table>
<thead>
<tr>
<th>Paths Number</th>
<th>50,000</th>
<th>500,000</th>
<th>5,000,000</th>
<th>50,000,000</th>
</tr>
</thead>
<tbody>
<tr>
<td>FPGA (Sec.)</td>
<td>2.63</td>
<td>25.2</td>
<td>242</td>
<td>2400</td>
</tr>
<tr>
<td>PC (Sec.)</td>
<td>63</td>
<td>630</td>
<td>6300</td>
<td>63000</td>
</tr>
<tr>
<td>Speedup</td>
<td>24.9</td>
<td>25</td>
<td>26</td>
<td>26.2</td>
</tr>
</tbody>
</table>
Wordlength optimisation

- given: financial applications require at least 4 decimal place accuracy
- use “computer arithmetic synthesis tool” (CAST) for software simulation
  - support fixed-point, floating-point simulations to generate generic VHDL descriptions

<table>
<thead>
<tr>
<th>Arithmetic</th>
<th>mul</th>
<th>add</th>
<th>div</th>
<th>acc</th>
</tr>
</thead>
<tbody>
<tr>
<td>Fixed-Point</td>
<td>(2, 31)</td>
<td>(2, 31)</td>
<td>(2, 31)</td>
<td>(2, 31)</td>
</tr>
<tr>
<td>Floating-Point</td>
<td>(8, 28)</td>
<td>(8, 28)</td>
<td>(8, 28)</td>
<td>(8, 28)</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Arithmetic</th>
<th>mul</th>
<th>add</th>
<th>div</th>
<th>acc</th>
</tr>
</thead>
<tbody>
<tr>
<td>Fixed-Point</td>
<td>(2, 31)</td>
<td>(2, 30)</td>
<td>(2, 15)</td>
<td>(2, 20)</td>
</tr>
<tr>
<td>Floating-Point</td>
<td>(3, 22)</td>
<td>(3, 30)</td>
<td>(3, 15)</td>
<td>(3, 15)</td>
</tr>
</tbody>
</table>
Fast division

- evaluate “F / (F+1)” where F is between 0 and 1
- use Hung’s method* with a small lookup table with 3 pipeline stages, two multiplications and one subtraction

\[
\frac{F}{1 + F} \approx \frac{F}{F_h^*} \left(1 - \frac{F_l}{F_h^*}\right) = \frac{F(F_h^* - F_l^*)}{F_h^{*2}}
\]

Configuration optimization

- VHDL implementation, synthesize with Xilinx XST, and implement using Xilinx EDK
- slices reduction due to the smaller exponent and fraction bit-widths
- significant BRAM reduction due to the bit-width reduction for the divider (by $2^8$)

<table>
<thead>
<tr>
<th>Configuration</th>
<th>Float-28</th>
<th>Float-Opt</th>
<th>Savings</th>
</tr>
</thead>
<tbody>
<tr>
<td>Frequency (MHz)</td>
<td>61.44</td>
<td>61.56</td>
<td>-</td>
</tr>
<tr>
<td>Slices</td>
<td>7,041</td>
<td>5,875</td>
<td>16.6%</td>
</tr>
<tr>
<td>Multiplier</td>
<td>48</td>
<td>48</td>
<td>0%</td>
</tr>
<tr>
<td>Block RAM</td>
<td>29</td>
<td>1</td>
<td>96.6%</td>
</tr>
</tbody>
</table>
Device utilization summary

- device utilization of XC2VP30FF896
  - number of SLICEs: 13,266 / 13,696 (96%)
  - number of Block RAMs: 74 / 136 (54%)
  - number of MULT18x18: 58 / 136 (42%)

<table>
<thead>
<tr>
<th></th>
<th>Processor</th>
<th>BGM Core</th>
<th>RNGs</th>
</tr>
</thead>
<tbody>
<tr>
<td>Number of SLICEs</td>
<td>2,168 (15%)</td>
<td>2,775 (20%)</td>
<td>5,820 (42%)</td>
</tr>
<tr>
<td>Number of Block RAMs</td>
<td>22 (16%)</td>
<td>16 (11%)</td>
<td>3 (2%)</td>
</tr>
<tr>
<td>Number of MULT18X18s</td>
<td>-</td>
<td>40 (29%)</td>
<td>-</td>
</tr>
<tr>
<td>Number of PPC405s</td>
<td>1 (50%)</td>
<td>-</td>
<td>-</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th></th>
<th>Post-processing</th>
<th>Misc Logics and buffers</th>
</tr>
</thead>
<tbody>
<tr>
<td>Number of SLICEs</td>
<td>538 (3%)</td>
<td>1,965 (19%)</td>
</tr>
<tr>
<td>Number of Block RAMs</td>
<td>2 (1%)</td>
<td>31 (24%)</td>
</tr>
<tr>
<td>Number of MULT18X18s</td>
<td>6 (4%)</td>
<td>12 (9%)</td>
</tr>
<tr>
<td>Number of PPC405s</td>
<td>-</td>
<td>-</td>
</tr>
</tbody>
</table>
Future work

• apply run-time reconfiguration technique
• extension work to cover other financial models
• integration MC simulation core with embedded system
• two embedded PowerPC in ML310
  – one for MC simulation,
  – one for embedded Linux,
  – communicate with other FPGA boards,
  – virtually unlimited scalability
Summary

1. hardware accelerator for Monte Carlo (MC) simulation: on-chip processor + reconfigurable logic
2. generalized number system: simulate and optimize designs
3. apply this MC simulation to support Brace, Gatarek and Musiela (BGM) interest rate model
4. efficient Gaussian random number generator, fast division techniques
5. XC2VP30FPGA at 50MHz MC hardware design: 25x Pentium4 1500MHz