For those of you that grew up with ADSR envelopes on the BBC Microcomputer or the Commodore 64 or on your Korg… ![]()
The Last Programmer Standing will be holding an FPGA
In my article Computing Without Processors http://cacm.acm.org/magazines/2011/8/114935-computing-without-processors/fulltext in the August edition of Communications of the ACM I advocate the deployment of heterogeneous computing technology to help address the ever pressing requirements we have for computing with reduced latency and reduced energy consumption. Significant challenges lie ahead for those that try to reign in the extreme computing and communication power offered by alternative architectures like GPUs and FPGAs. In today’s world the battlefronts are defined by who can compute the fastest (at reasonable cost) e.g. compute a more precise answer to a high level query for a decision engine. The war will be won by the programmer holding an FPGA (or its moral evolutionary equivalent).
For on-line reading you may prefer the ACM Queue version of the article: http://queue.acm.org/detail.cfm?id=2000516
Computing Without Processors article on ACM Queue
My article on Computing Without Processors is on the ACM Queue page now: http://queue.acm.org/detail.cfm?id=2000516
The article explains why it is important for us to learn to program not just regular microprocessors but also many other kinds of computing elements. As we struggle to meet the latency and energy consumption constraints we will inevitably be forced to use heterogeneous architectures which comprise the evolution of today’s GPUs (for data-parallelism) and FPGA (for 2D and 3D spatial computing).
Creating a Windows DLL from a Haskell Program and calling it from C++
Every so often you have to leave the warm comfortable safe world of Haskell and venture out into what some unsavoury types call the “the real world”. I’ve always thought it better to call the real world rather than have it call on me in Haskell-land. But when that does not work you have the bite the bullet and make a DLL that can be called from other Windows programs. Every so often I do this and it works (for a single DLL) and then time passes and GHC changes and then suddenly you have to do something else with the latest shiny version of GHC. So here is the additional information you need to what’s in the Haskell Platform 2011.2.0.1 distribution documentation (11.6.2 Making DLLs to be called from other languages) which works with version 7.0.3 of GHC.
Step 1: Read the documentation but of course don’t completely believe it because it is written from the perspective of people that for some strange reason use operating systems with names that don’t start with “Windows”.
Step 2: As the documentation says create a file called Adder.hs and put the Haskell adder function in it:
-- Adder.hs{-# LANGUAGE ForeignFunctionInterface #-}module Adder where
adder :: Int -> Int -> IO Int -- gratuitous use of IOadder x y = return (x+y)
foreign export stdcall adder :: Int -> Int -> IO Int
Step 3: You also need to create a program to start up and wind down the GHC run-time which you should put in the file StartEnd.c:
// StartEnd.c #include <Rts.h> extern void __stginit_Adder(void); void HsStart() { int argc = 1; char* argv[] = {"ghcDll", NULL}; // argv must end with NULL // Initialize Haskell runtime char** args = argv; hs_init(&argc, &args); // Tell Haskell about all root modules hs_add_root(__stginit_Adder); } void HsEnd() { hs_exit(); }
Step 4: Compile up these programs to generate the DLLs etc.
ghc -c Adder.hs
ghc -c StartEnd.c
ghc -shared -o Adder.dll Adder.o Adder_stub.o StartEnd.o
which will create the files Adder.dll, Adder.dll.a and Adder_stub.h which you will need in the next step.
Step 5: Now you can fire up Visual Studio and write a program that calls the Haskell adder function.
// haskell_from_cpp.cpp : main project file. #include "stdafx.h" #include <stdio.h> #include "Adder_stub.h" extern "C" { void HsStart(); void HsEnd(); } using namespace System; int main(array<System::String ^> ^args) { HsStart(); // can now safely call functions from the DLL printf("12 + 5 = %i\n", adder(12,5)) ; HsEnd(); return 0; }
Note that you don’t need to include the header file for HsFFI.h because this is included in the header file Adder_stub.h. HOWEVER you now need to hack one of the header files. Specifically, you have to change the way the calling convention is specified for the adder function so that it is accepted by the Microsoft C++ compiler by editing the file Adder_stub.h:
#include "HsFFI.h" #ifdef __cplusplus extern "C" { #endif extern HsInt __stdcall adder(HsInt a1, HsInt a2); #ifdef __cplusplus } #endif
Specifically, replace “__attribute__((__stdcall__))” with just “__stdcall ”.
Now you will need to copy over some extra header files from your Haskell install directory from the lib/include sub-directory (look at the Header Files list in the window below). On my machine the include files are in C:\Program Files (x86)\Haskell Platform\2011.2.0.1\lib\include. You need ghcautoconf.h, ghconfig.h, ghcplatform.h, HsFFI.h and from the std directory you need Types.h (copy the std directory).
You need to adjust your project setting so the C++ compiler can find the header files. You will also need to add Adder.dll.a as an input to the Additional Dependencies filed under the Input section of the Linker settings for the project and perhaps add the search path for this file to the Additional Library Dependencies input field under the Linker tab. Now you should be able to build to generate an executable (in my case in the Debug sub-directory of the project).
Step 7: Copy Adder.dll to the same directory as the executable and run the program.
satnams@MSRC-ardberg1 ~/haskell/dll/haskell_from_cpp/Debug
$ ./haskell_from_cpp
12 + 5 = 17
You have now finally built a DLL from a Haskell program that can be called from C++ under Windows. Now, it was not that painful. Was it?
Compiling C# Programs to FPGA Circuits: An Ethernet Packet Processing Example
In a previous blog article I showed how to get the default Ethernet address swap circuit working on the ML605 board. The address swap module (which can be generated in VHDL or Verilog) is tedious to write yourself because you have to explicitly write it as a state machine. In typical hardware description languages you have “lost the semicolon” and the program you write very much resembles coding with call-backs in GUIs or REST-style programming or writing directly in continuation passing style. With David Greaves at the University of Cambridge I’ve been working on a system called Kiwi which tries to civilize hardware design by allowing us to model circuit behaviour with multi-threaded C# programs (or any other .NET language) and then have these programs automatically converted into Verilog hardware descriptions ready for implementation on FPGAs.
The code below describes a circuit which processes raw Ethernet packets and is designed to mimic the hand written VHDL or Verilog that comes with the Ethernet wrapper provided by Xilinx. It’s job is to read a packet into memory, swap the source and destination MAC addresses and send the packet back out. Unlike the VHDL or Verilog versions this circuit stores the whole packet in memory and can crawl over it at line speed to perform various kinds of inspections (e.g. for a virus) or processing operations (e.g. for manipulating Ethernet packets as they flow through the network).
The circuit works at the level of the LocalLink protocol which is used to wrap the raw Ethernet level signals to make it easier to send and receive frames and to do flow control. Our objective is to generate a behaviour like this (simulator output):
or this (output from live running circuit running on the ML605 board using a logic analyser):
until the end of this incoming frame and then generate the outgoing frame with the MAC address swapped i.e. 00:22:48:FA:3F:87swapped with 00:26:B9:79:91:6B
Here’s the C’# source to perform the Ethernet echo with the MAC address swap:
using KiwiSystem; class LocalLinkLoopBackTest { static byte[] buffer = new byte[1024]; class EthernetEcho { [Kiwi.Hardware()] // This class should be implemented in hardware // This class describes a circuit which echos a raw Ethernet frame // and is designed to mimic the functionality of the address swap // module provided by the Xilinx Core Generator Tri-Mode Ethernet MAX wrapper. // DST is this hardware module // SRC is the other end (e.g. test bench or the Ethernet MAC interface) // RX is the incoming data // TX is the outgoing data // These are the ports of the circuit (and will appear as ports in the generated Verilog) [Kiwi.InputWordPort("rx_data")] static byte rx_data; // Write data to be sent to device [Kiwi.InputBitPort("rx_sof_n")] static bool rx_sof_n; // Start of frame indicator [Kiwi.InputBitPort("rx_eof_n")] static bool rx_eof_n; // End of frame indicator [Kiwi.InputBitPort("rx_src_rdy_n")] static bool rx_src_rdy_n; // Source ready indicator //[Kiwi.OutputBitPort("rx_dst_rdy_n")] //static bool rx_dst_rdy_n; // Destination ready indicator [Kiwi.OutputWordPort("tx_data")] static byte tx_data; // Write data to be sent to device [Kiwi.OutputBitPort("tx_sof_n")] static bool tx_sof_n; // Start of frame indicator [Kiwi.OutputBitPort("tx_eof_n")] static bool tx_eof_n; // End of frame indicator [Kiwi.OutputBitPort("tx_src_rdy_n")] static bool tx_src_rdy_n; // Source ready indicator [Kiwi.InputBitPort("tx_dst_rdy_n")] static bool tx_dst_rdy_n; // Destination ready indicator // Thus buffer stores an incoming Ethernet frame static byte[] buffer = new byte[1024]; // This method describes the operations required to echo the Ethernet frame static public void echo() { tx_sof_n = !false; // We are not at the start of a frame tx_src_rdy_n = !false; tx_eof_n = !false; // We are not at the end of a frame bool start = !rx_sof_n && !rx_src_rdy_n; // The start condition int i, j; bool doneReading; while (true) // Process packets indefinately { // Wait for SOF and SRC_RDY while (!start) { Kiwi.Pause(); // Wait for a clock tick start = !rx_sof_n && !rx_src_rdy_n; // Check for start of frame } // Read in the entire frame i = 0; doneReading = false; // Read the remaining bytes while (!doneReading) { if (!rx_src_rdy_n) { buffer[i] = rx_data; i++; } doneReading = !rx_eof_n; Kiwi.Pause(); } tx_src_rdy_n = !true; // We are not at the start of a frame // Now send an Ethernet packet back to where it came from // Swap source and destination MAC addresses j = 0; tx_sof_n = !false; for (j = 6; j < 12; j++) // Process a 6 byte MAC address { tx_data = buffer[j]; tx_sof_n = j != 6; Kiwi.Pause(); } for (j = 0; j < 6; j++) // Process a 6 byte MAC address { tx_data = buffer[j]; Kiwi.Pause(); } // Transmit the remaining bytes j = 12; while (j < i) { tx_data = buffer[j]; if (j == i - 1) tx_eof_n = !true; j++; Kiwi.Pause(); } tx_src_rdy_n = !false; tx_eof_n = !false; start = false; // No longer at start of frame Kiwi.Pause(); // End of frame, ready for next frame } } }
Compared to VHDL or Verilog it is very convenient to be able to use normal programming language control flow constructs. This makes Kiwi convenient for writing sophisticated network processing algorithms in a “direct” style rather than explicitly coding finite state machines. Furthermore, the circuit can be tested in Visual Studio using regular debugging and code analysis tools.
The generated Verilog can be simulated or turned into an FPGA circuit using the Xilinx design tools e.g. on this XC6VLX240T chip (it’s the light blue dots on the right hand side):
We chose a 125MHz clock to drive our packet processing circuit although the circuit is capable of running much faster. You can find out more about Kiwi by looking at the papers on my Microsoft website.
A small part of the code above (a simpler shift register variant rather than storing the whole packet) in VHDL would look like this (yuk!):
control_fsm_sync_p : process(rx_ll_clock)
begin
if rising_edge(rx_ll_clock) then
if rx_ll_reset = '1' then
control_fsm_state <= wait_sf;
else
if rx_enable = '1' then
case control_fsm_state is
when wait_sf =>
if sof_sr_content(4) = '1' then
control_fsm_state <= bypass_sa1;
else
control_fsm_state <= wait_sf;
end if;
when bypass_sa1 =>
if not(sof_sr_content(4) = '0' and eof_sr_content(4) = '1') then
control_fsm_state <= bypass_sa2;
else
control_fsm_state <= wait_sf;
end if;
when bypass_sa2 =>
if not(sof_sr_content(4) = '0' and eof_sr_content(4) = '1') then
control_fsm_state <= bypass_sa3;
else
control_fsm_state <= wait_sf;
end if;
when bypass_sa3 =>
if not(sof_sr_content(4) = '0' and eof_sr_content(4) = '1') then
control_fsm_state <= bypass_sa4;
else
control_fsm_state <= wait_sf;
end if;
when bypass_sa4 =>
if not(sof_sr_content(4) = '0' and eof_sr_content(4) = '1') then
control_fsm_state <= bypass_sa5;
else
control_fsm_state <= wait_sf;
end if;
when bypass_sa5 =>
if not(sof_sr_content(4) = '0' and eof_sr_content(4) = '1') then
control_fsm_state <= bypass_sa6;
else
control_fsm_state <= wait_sf;
end if;
when bypass_sa6 =>
if not(sof_sr_content(4) = '0' and eof_sr_content(4) = '1') then
control_fsm_state <= pass_rof;
else
control_fsm_state <= wait_sf;
end if;
when pass_rof =>
if not(sof_sr_content(4) = '0' and eof_sr_content(4) = '1') then
control_fsm_state <= pass_rof;
else
control_fsm_state <= wait_sf;
end if;
when others =>
control_fsm_state <= wait_sf;
end case;
end if;
end if;
end if;
end process; -- control_fsm_sync_p
----------------------------------------------------------------------------
--Process control_fsm_comb_p
--Determines control signals from control_fsm state
----------------------------------------------------------------------------
control_fsm_comb_p : process(control_fsm_state)
begin
case control_fsm_state is
when wait_sf =>
sel_delay_path <= '0'; -- output data from data shift register
enable_data_sr <= '1'; -- enable data to be loaded into shift register
when bypass_sa1 =>
sel_delay_path <= '1'; -- output data directly from input
enable_data_sr <= '0'; -- hold current data in shift register
when bypass_sa2 =>
sel_delay_path <= '1'; -- output data directly from input
enable_data_sr <= '0'; -- hold current data in shift register
when bypass_sa3 =>
sel_delay_path <= '1'; -- output data directly from input
enable_data_sr <= '0'; -- hold current data in shift register
when bypass_sa4 =>
sel_delay_path <= '1'; -- output data directly from input
enable_data_sr <= '0'; -- hold current data in shift register
when bypass_sa5 =>
sel_delay_path <= '1'; -- output data directly from input
enable_data_sr <= '0'; -- hold current data in shift register
when bypass_sa6 =>
sel_delay_path <= '1'; -- output data directly from input
enable_data_sr <= '0'; -- hold current data in shift register
when pass_rof =>
sel_delay_path <= '0'; -- output data from data shift register
enable_data_sr <= '1'; -- enable data to be loaded into shift register
when others =>
sel_delay_path <= '0';
enable_data_sr <= '1';
end case;
end process; -- control_fsm_comb_p
Using the Virtex-6 Embedded Tri-Mode Ethernet MAC Wrapper v1.4 with the ML605 Board
Whenever I get a new Xilinx development board I go through the tedious process of porting my circuits that use Ethernet communication to the new board. To save other people the pain I went through to get this to work I have written this blog which outlines the changes you need to make to the automatically generated Ethernet MAC wrapper to make the default echo circuit work. Note that these instructions only apply to the Xilinx ML605 development board:

I have tested these instructions with ISE version 12.4 (running on Windows 7 64-bit). Step 1: generate the wrapper from Core Generator:
Select the GMII interface and leave the other options unchanged:
Make sure that you have jumpers J66 and J67 over pins 1 and 2 for GMII operation. (You can try to get SGMII working by setting the jumpers overs pins 2 and 3. Good luck.)
On the next screen I suggest you set the address filter feature (please don’t steal the address below especially if you work at Microsoft).
Now generate the core. Now you need to edit the UCF file. You should read the user guides for the Ethernet MAC and the wrapper to work out the details for your particular situation but here are some of the things you will need to do. First, add the necessary pin assignments. For reset I use the centre push-button:
NET "RESET" LOC = G26; # Centre push-button
For clocking I use the differential 200MHz clock inputs from the Epson crystal from which I will derive the 125MHz clock to drive the Ethernet sub-system:
INST SYSCLK_N LOC = H9 ; INST SYSCLK_P LOC = J9 ; NET SYSCLK_P TNM_NET = sysclk; TIMEGRP sysclk_grp = sysclk; TIMESPEC TS_sysclk = PERIOD sysclk_grp 5 ns HIGH 50 %;
I like to reset the PHY chip while the clocking circuitry is waiting to start up:
NET "PHY_RESET" LOC = "AH13";
You will need to wire up the GMII interface pins:
INST "GMII_TXD<0>" LOC = "AM11"; INST "GMII_TXD<1>" LOC = "AL11"; INST "GMII_TXD<2>" LOC = "AG10"; INST "GMII_TXD<3>" LOC = "AG11"; INST "GMII_TXD<4>" LOC = "AL10"; INST "GMII_TXD<5>" LOC = "AM10"; INST "GMII_TXD<6>" LOC = "AE11"; INST "GMII_TXD<7>" LOC = "AF11"; INST "GMII_TX_EN" LOC = "AJ10"; INST "GMII_TX_ER" LOC = "AH10"; INST "GMII_TX_CLK" LOC = "AH12"; INST "GMII_RXD<0>" LOC = "AN13"; INST "GMII_RXD<1>" LOC = "AF14"; INST "GMII_RXD<2>" LOC = "AE14"; INST "GMII_RXD<3>" LOC = "AN12"; INST "GMII_RXD<4>" LOC = "AM12"; INST "GMII_RXD<5>" LOC = "AD11"; INST "GMII_RXD<6>" LOC = "AC12"; INST "GMII_RXD<7>" LOC = "AC13"; INST "GMII_RX_DV" LOC = "AM13"; INST "GMII_RX_ER" LOC = "AG12"; INST "GMII_RX_CLK" LOC = "AP11"; and you will need to comment out some constraints and also remove the GTX_CLK reference because we are going to derive this from the 200MHz clock.
In the top level design file you will need to make several changes including adding ports for the 200MHz clock input and the PHY reset and remove the GTX_CLK:
-- Reference clock for IODELAYs -- REFCLK : in std_logic; SYSCLK_N, SYSCLK_P : in std_logic ; PHY_RESET : out std_logic ;
To generate the clocks you should run the Clock Wizard and specify a 200MHz source clock with differential inputs:
On the next screen define a 125MHz clock output and a 200MHz clock output. Either select BUFG outputs (and then comment out the manually instantiated buffers in the top level design and replace them with regular signal assignments) or select no output buffer and make use of the IBUFGs and BUFGs in the existing design.
Wire up the clock generation core in the top level module:
clk_wiz : clk_wiz_v1_8 port map (-- Clock in ports CLK_IN1_P => SYSCLK_P, CLK_IN1_N => SYSCLK_N, -- Clock out ports CLK_OUT1 => GTX_CLK, CLK_OUT2 => REFCLK, -- Status and control signals RESET => reset_i, LOCKED => locked);
I used the locked output to control the PHY reset.
Now you should be able to generate the bit-stream and test it out. You can add a ChipScope probe to see what is going on:
icon : chipscope_icon port map (CONTROL0 => CONTROL);
ila : chipscope_ila port map (CONTROL => CONTROL,
CLK => ll_clk_i,
DATA => DATA,
TRIG0 => TRIG0);
trig0(0) <= rx_ll_sof_n_i ;
DATA(7 downto 0) <= rx_ll_data_i ;
DATA(8) <= rx_ll_sof_n_i ;
DATA(9) <= rx_ll_eof_n_i ;
DATA(10) <= rx_ll_src_rdy_n_i ;
DATA(11) <= rx_ll_dst_rdy_n_i ;
DATA(19 downto 12) <= tx_ll_data_i ;
DATA(20) <= tx_ll_sof_n_i ;
DATA(21) <= tx_ll_eof_n_i ;
DATA(22) <= tx_ll_src_rdy_n_i ;
DATA(23) <= tx_ll_dst_rdy_n_i ;
and you should be able to observe an address-swap Ethernet echo:
which is also confirmed in WireShark:
Now, how do you send a raw Ethernet packet from a Windows 7 64-bit machine? That might be the subject of another blog post…
Good luck!
Satnam
Reconfigurable Data Processing for Clouds
Along with Anil Madhavapeddy at the University of Cambridge I’ve been thinking about how to apply reconfigurable computing technology (for example FPGAs) in data-centres and in cloud computing systems. Reconfigurable computing for some time now has had the potential to make a huge impact on mainstream high performance computing. We will soon have a million LUTs or more of highly parallel fine grain parallel processing power, and the ability to define high bandwidth custom memory hierarchies offers a compelling combination of flexibility and performance. However, mainstream adoption of reconfigurable computing has been hampered by the need to use and maintain specialized FPGA-based boards and clusters and the lack of programming models that make this technology accessible to mainstream programmers. FPGAs do not enjoy first class operating system support and lack the application binary interfaces (ABIs) and abstraction layers that other co-processing technologies enjoy (most notably GPUs).
By placing FPGAs on the same blades as GPUs and multicore processors in the cloud and offering them as a managed service with a high level programming infrastructure we can see a new dawn for reconfigurable computing which makes this exciting technology available to millions of developers without taking on the overhead of buying and maintaining specialized hardware and without having to invest in complex tool-chains and programming models based on the low-level details of circuit design. The major limitation on the growth potential of data-centres is now energy consumption and it is here where alternative computing resources like FPGAs can make a significant impact allowing us to scale out cloud operations to an extent which is not possible using just conventional processors.
Just a decade ago, it was common for a company to purchase physical machines and place them with a hosting company. As the Internet’s popularity grew, their reliability and availability requirements also grew beyond a single data-centre. Sites such as Microsoft, Google and Amazon started building huge data-centres in the USA and Europe, with correspondingly larger energy demands. These providers had to provision for their peak load, and had much idle capacity at other times. At the same time, researchers were examining the possibility of dividing up commodity hardware into isolated chunks of computation and storage, which could be rented on-demand. The XenoServers project forecast a public infrastructure spread across the world, and the Xen hypervisor was developed as an open-source solution to partition multiple untrusted operating systems. Xen was adopted by Amazon to underpin its Elastic Computing service, and it thus became the first commercial provider of what is now dubbed “cloud computing”— renting a slice of a huge data-centre to provide on-demand computation resources that can be dynamically scaled up and down according to demand.
Cloud computing brought reconfigurable computing to the software arena. Hardware resources are now dynamic, and so sudden surges in load can be adapted to by adding more virtual machines to a server pool until it subsides. This resulted in a surge of new datacenter components designed to scale across machines, ranging from storage systems like Dynamo and Cassandra to distributed computation platforms like Google’s MapReduce and Microsoft’s Dryad which have also inspired FPGA-based map-reduce idioms.
Where are the Hardware Clouds?
Cloud data-centres have been encouraging horizontal scaling by increasing the number of hosts. The vertical scaling model of more powerful individual machines is now difficult due to the shift to multi-core CPUs instead of simply cranking up the clock speed. IBM notes that with “data centers using 10–30 times more energy per square foot than office space, energy use doubling every 5 years, and [..] delayed capital investments in new power plants”, something must change. The growth potential of these data-centres is now energy limited, and the inefficiency of the software stack is beginning to take its toll. This is an ideal time to dramatically improve the efficiency of data-centres by mapping common and large-scale tasks into shared, million-LUT FPGAs boards that complement the general-purpose hardware currently installed.
Challenges
Reconfigurable FPGA technology has not made good inroads into commodity deployment, and we now consider some of the reasons. One major problem is that the CAD tools that have been developed to date predominately target a mode of use that is very different from what is required for reconfigurable computing in the cloud. CAD tools target an off-line scenario where the objective is to implement some function (e.g. network packet processing) in the smallest area (to reduce component cost by allowing the use of a smaller FPGA) whilst also meeting timing closure. The designed component becomes part of an embedded system (e.g. a network router) of which thousands or millions of units are made. Performance and utilisation are absolutely key and productivity has not been as paramount since the one off engineering costs are amortised over many units.
The use of FPGA for computing and co-processing has very different economics which is poorly supported by the mainstream vendor tools. Reconfigurable computing elements in the cloud need CAD tools that prioritise flexibility and they need to be programmable by regular mainstream programmers familiar with technologies like .NET and Java, and we can not require them to be ace Verilog programmers. Mainstream tool vendors have responded poorly to the requirements of this constituency because up until now they represent a very small part of the market.
Reconfigurable computing in the cloud promises to change that and there will be a new demand for genuinely high level synthesis tools that map programs to circuits. We need high level synthesis which works in much broader sense than current tools i.e. with a focus on applications rather than compute kernels. We need to be able to compile dynamic data-structures, recursion and very heterogeneous fine-grain parallelism and not just data-parallelism over well behaved static arrays. The need to synthesize applications rather than matrix-based kernel operations poses fresh new challenges for the high level synthesis community. This new scenario also enables new models of use for such tools e.g. enabling run-once circuits which is something that makes no sense in the conventional embedded use model.
In order to achieve the vision of a reconfigurable cloud based computing system we need to perform significant innovations in the area of CAD tools to make them fit for (the new) purpose. The closed nature of commercial vendor tools and the lack of current business model alignment of current FPGA manufacturers means that we can not expect the required support and capabilities. In particular, we need to develop cloud based reconfigurable computing models which largely abstract the underlying architecture so a computation can be mapped to either a Xilinx or an Altera or another vendor part because we wish to make the reconfigurable resource a commodity item. However, vendors have conventionally aggressively resisted commoditisation. However, this is essential for the reconfigurable computing research community to take full ownership of the programming chain starting at an unmapped netlist (perhaps against a generic library) all the way to a programming bit-stream plus the extra information required to support dynamic reconfiguration. The community already has excellent work on tools like VPR, Torc, and LegUp which can be used as a starting point for such a tool chain. A similar precedent in the software world is LLVM, which has grown into a mature framework for working on most aspects of compiler technology.
Overlay Architectures
When one implements an algorithm on a microprocessor the fundamental task involves working out to “drive” a fixed architecture to accomplished the desired computation. When one implements an algorithm on an FPGA one can devise the tailored architecture for the given problem. In general the former approach affords flexibility and rapid development at the cost of performance and the latter affords performance at the cost of development effort. There is an important intermediate step that involves the use of overlay architectures where one implements a logical architecture (e.g. some form of vector processor) on top of a regular reconfigurable architecture like an FPGA. Indeed, there is healthy level of research work in the area of soft vector processors e.g. VESPA, VIPERS and VEGAS. This approach becomes particularly pertinent for cloud based deployment of reconfigurable systems because it make two very important contributions:
- reconfiguration time is no longer an issue because what is downloaded to the FPGA is `program’ for the overlay architecture and not a reconfiguration bit-stream
- the semantic gap between the regular programmers and the reconfigurable architecture and tools is reduced because now the overlay architecture acts like a contract between the programmer and the low level device
One could image a cloud based reconfigurable computing system where a user submits a program along with the architecture required to execute it. The architecture specification could be just an identifier to specify a fixed system provided architecture, a set of architectures required (which can be mixed an matched depending on resource availability and latency and energy requirements) or it may include an actual detailed circuit intermediate representation which is compiled in the cloud to a specific FPGA bit-stream. Furthermore, such approaches need the support of layout specification systems like Lava.
We also note that Texas Instruments just introduced the LabVIEW FPGA Cloud Compile mode which uses remote VMs to do layout; using the software VMs to bootstrap specialized hardware is particularly appealing. Related to the idea of overlay architectures is the notion of building flexible memory hierarchies out of soft logic (CoRAM and LEAP) and these approaches are a promising start for managing access to shared off-chip memory for a virtualized reconfigurable computing resource in the cloud.
Domain Specific Languages (DSLs)
Can we devise one language to program heterogeneous computing systems? Absolutely not. It is far better to provide a rich infrastructure for the execution and coordination of computations (and processes) and allow the community to innovate solutions for specific situations. For example, for users it is desirable to use OpenCL to express data-parallel computations that can generate code which can execute on a multicore processor, GPUs, be compiled to gates or execute on a soft vector processor on an FPGA. We already have examples of projects that take OpenCL or a closer variation for compilation to GPUs and FPGAs. Others might want to specify a MATLAB-based computation that streams data from a SQL-server data-base to process images and videos. Others might want to express SQL queries which are applied to information from databases in the cloud or applied to data streams from widely distributed sensors. Or perhaps someone wants to use Java with library based concurrency extensions. Rather than mandating one language and model we should instead allow the community to cultivate and incubate the necessary domain specific languages and the reconfigurable cloud based operating system should provide the “backplane” required to execute and compose these domain specific languages.
A concrete way to start defining an infrastructure is to develop a coordination and communication architecture for processes along with a concrete representation e.g. using an XML Schema. This could then be used to compose and map computations (e.g. map this kernel to a computing resources that his this capability), specify data-sources (stream input from this SQL-server data-source, stream output to a DRAM-based resource) etc. A program is now a collection of communicating processes along with a set of recipes for combining the processes for achieving the desired result and an invocation of a program involves the specification of non-functional requirements like “minimize latency” or “minimize energy consumption”.
Research Directions and Hot Topics in Clouds
Cloud computing is quite new and evolving rapidly. We now list some of the major interest areas from that community, and some of the more problems that large, shared reconfigurable FPGAs could help to solve.
Operating Systems
The traditional role of an operating system of partitioning physical hardware is quite different when virtualised in the cloud. Hypervisors expose simple network and storage interfaces to virtual machines, with the details of physical device drivers handled in a separate domain. Kernels that run on the cloud only need a few device drivers to work, and no longer have to support the full spectrum of physical devices. The figure below illustrates the difference between managing physical devices and using a portion of that resource from a virtualised application.
Split-trust devices on virtualised platforms have a management domain that partitions physical resources, allocates portions to guests, and enforces access rights. The details of the management policy is specific to the type of resource, such as storage or networking.
The management APIs in the control domain differ across resource types. The handling of networking involves bridging and topology and integration with systems such as OpenFlow. Storage management is concerned with snapshots and de-duplication for the blocks used by VMs. However, the front-end device exposed to the guests is simple; often little more than a shared-memory structure with a request/response channel. This technique is so far mainly used for I/O channels but could be extended to actually run computation over the data, with the same high-throughput and low-latency that existing I/O systems enjoy. The guest could specify the computation required (possibly as a DSL), and the management tools would implement the details of interfacing with the physical board and managing sharing across VMs. This is simple from the programmer’s perspective, and portable across different FPGA boards and tool-chains.
The availability of GPUs and programmable I/O boards have led to the development of new software architectures. Helios is a new operating system designed to simplify the task of writing, deploying and profiling code across heterogeneous platforms. It introduces “satellite kernels” that export a uniform set of OS abstractions, but are independent tasks that run across different resources.
Data-centre Programming
Processing large datasets requires partitioning it across many hosts, and so distributed data-flow have become popular in recent years. These frameworks expose a simple programming model, and transparently handle the difficult aspects of distribution: fault tolerance, resource scheduling, synchronization and message passing. MapReduce and Dryad are two popular frameworks for certain classes of algorithms and more recently the CIEL execution engine also adds supports for iterative algorithms (e.g. k-means or binomial options pricing).
These frameworks all build Directed Acyclic Graphs (DAGs), where the nodes represent data, and the edges are computation over the data. The run-time schedules compute on specific nodes, and iteratively walks the DAG until a result is obtained. It can also prepare the host before a node is processed, such as replicating some required input data from a remote host. This can also include compilation, and so an FPGA DSL could be transparently scheduled to hardware (as available) or executed in software if not available. The main challenge is to track the cost of reconfiguring FPGAs rather than just executing it in software, but this is made easier since the run-time can inspect the size of the input data at runtime. Mesos investigates how to partition physical resources across multiple frameworks operating on the same set of hosts, which is useful when considering fixed-size FPGA boards.
The recent surge of new components designed specifically for data-centres also encourages research into new database models that depart from SQL and traditional ACID models. Mueller programmed data processing operators on top of large FPGAs, and concluded that the right computation model is essential (e.g. an asynchronous sorting network). Within these constraints however, they had comparable performance and significantly improved power-consumption and parallelisation—both areas essential to successful data-centre databases in the modern world.
A close integration between high-level host languages and FPGAs will greatly help adoption by mainstream programmers. The fact that C and C++ are considered low-level languages in the cloud, and high-level to FPGA programmers is indicative of the cultural difference between the two communities! There are a number of promising efforts that embed DSLs in C/C++ code to ease their integration. MORA is a DSL for streaming vector and matrix operations, aimed at multimedia applications. Designs can be compiled into normal executables for functional testing, before being retargeted at a hardware array. MARC uses the LLVM compiler infrastructure to convert C/C++ code to FPGAs. Although performance is still lower than a manually optimised FPGA implementation, it is significantly less effort to design and implement portably due to its higher-level approach. This quicker code/deployment/results cycle is essential to incrementally get feedback about code for the more casual programmer, who is using renting cloud computing resources in order to save time in the first place.
Some languages now separate data-parallel processing explicitly so that they can utilise resources such as GPUs. Data parallel Haskell integrates the full range of types available modern languages, and allows sum types, recursive types, higher-order functions and separate compilation. Accelerator is a library to synthesise data-parallel programs written in C# directly to FPGAs. A more radical embedding is via multi-stage programming, where programmers specify abstract algorithms in a high-level language that is put through a series of translation stages into the desired architecture. All of these approaches are highly relevant to reconfigurable FPGA computing in the cloud, as they extend existing, familiar programming languages with the constraints required to compile sub-sets into hardware. The Kiwi project at Microsoft and the Liquid Metal project at IBM both provide a route from high level concurrent programs to circuits.
Information Security
The cloud is often used to outsource processing over large datasets. The code implementing the batch-processing is often written in C or Fortran, and a bug in handling input data can let attackers execute arbitrary code on the host machine (see the figure below left). Although the hypervisor layer contains the attacker inside the virtual machine, they still have access to many of the local network resources, and worse, other (possibly sensitive) datasets. Exploits are mitigated by using software privilege separation, but this places trust in the OS kernel layer instead.
Malicious data can be crafted to exploit memory errors and execute as code on a CPU (left). With an FPGA interface, it cannot execute arbitrary code, and the CPU never iterates over the data (right).
Data processing on the cloud using reconfigurable FPGAs offers an exceptional improvement in security by shifting that trust from software to hardware. Specialising algorithms into FPGAs entirely removes the capability of attackers to run arbitrary code, thus enforcing strong privilege separation. The application compiles its algorithms to an FPGA, and never directly manipulates the data itself via the CPU. Malicious data never gets the opportunity to run on the host CPU, and instead only a small channel exists between the OS and FPGA to communicate results (see figure above right). The conventional threat model for FPGAs is that a physical attacker can compromise its hardware SRAM. When deployed in the cloud, the attacker cannot gain physical access, leaving few attack vectors.
Moving beyond low-level security, there is also a realisation that data contents needs protection against untrusted cloud infrastructure. Encoding data processing tasks across FPGAs enforces a data-centric view of computation, distinct from coordinating computation (e.g. load balancing or fault tolerance, which cannot compromise the contents of data). Programming language researchers have mechanisms for encoding information flow constraints, and more recently, statistical privacy properties. These techniques are often too intrusive to fully integrate into general-purpose languages, but are ideal for the domain-specific data-flow languages which provide the interface between general-purpose and reconfigurable FPGA computing.
Another intriguing development is homomorphic encryption, which permits computation over encrypted data without being able to ever decrypt the underlying data. The utility of homomorphic encryption has been recognised for decades, but has so far been extremely expensive to implement. Cloud computing revitalises the problem, as malicious providers might be secretly recording data or manipulating results. Recently, there have been several lattice-based cryptography schemes that reduce the complexity cost of homomorphic encryption. Lattice reduction can be significantly accelerated via FPGAs; Detrey et al. report a speedup of 2.12 of an FPGA versus a multi-core CPU of comparable costs. This points to a future where reconfigurable million-LUT FPGAs could be used to perform computation where even the cloud vendor is untrusted!
Reducing the cost of cryptography in the cloud could also have significant social impact. The Internet has seen large-scale deployment of anonymity networks such as Tor and FreeNet for storing data. Due to the encryption requirements imposed by onion-routing, access to such networks remains slow and high-latency. There have been proposals to shift the burden of anonymous routing into the cloud to fix this, but reducing the cost (financially) remains one of the key barriers to more widespread adoption of anonymity. This is symptomatic of the broader problem of improving networking performance in the cloud. Central control systems such as OpenFlow are rapidly gaining traction, along with high-performance implementation in hardware. Virtual networking is reconfigured much more often than hardware setups (e.g. for load-balancing or fault tolerance), and services such as Tor further increase the gate requirements as computation complexity increases.
The challenge, then, for integration into the cloud, is how to unify the demands of data-centric processing, language integration, network processing into a single infrastructure. Specific problems that have to be addressed in addition to those mentioned above include:
- The need for better OS integration, device models, and abstractions (as with split-trust in Xen described earlier).
- Without an ABI, software re-use and integration is very difficult. How can (for example) OpenSSL take advantage of an FPGA?
- Debugging and visualization support. General purpose OSes have provide a hypervisor-kernel-userspace-language runtime model gets progressively easier and higher-level to debug. Abstraction boundaries exist where they don’t in current FPGAs. Staged programming or functional testing in a general-purpose systems makes this easier.
- We need to develop a common set of concepts, principles and models for application execution on reconfigurable computing platforms to allow collaboration between universities and companies and to provide a solid framework to build new innovations and applications. This kind of eco-system has been sadly lacking for reconfigurable systems.
It is encouraging that cloud computing is driven by charging at a finer granularity for solving problems. Reconfigurable FPGAs have driven down the cost of many types of computation commonly found on the cloud, and thus a community-driven deployment of a cloud setup with rentable hardware would provide a focal point to “fill in the blanks” for reconfigurable FPGA computing in the cloud.
It is not clear that these goals can be achieved by a collection of parallel independent university and industry projects. What is needed is a coordinated research program involving members of the reconfigurable computing community working with each other and researchers in cloud computing to define a new vision of where we would like to go and then set standards etc. to try and achieve that goal. For this to happen we will need some kind of wide ranging joint research project proposal or standardization effort.
Call to Arms
Reconfigurable computing is at the cusp of rising up from being a niche activity accessible to only a small group of experts to becoming a mainstream computing fabric used in concert with other heterogeneous computing elements like GPUs. For this to become a reality we need to combine some of the successes in the FPGA-based research with new thinking about programming models to create a development environment for “civilian programmers”. This will require collaboration between researchers in architecture, CAD tools, programming languages and types, run-time system development, web services, scripting and orchestration, re-targetable compilation, instrumentation and monitoring of heterogeneous systems, and failure management. Furthermore, the requirements of reconfigurable computing in a shared cloud service context also places new requirements on CAD tools and architectures which are at odds with their current requirements. Today FPGA vendors produce architectures for use in an embedded context to be programmed by digital design engineers.
Yesterday’s programmers of reconfigurable systems were highly trained digital designers using Verilog. Today were at the cusp of a revolution which will make tomorrow’s users of reconfigurable technology regular software engineers who will map their algorithms onto a heterogeneous mixture of computing resources to achieve currently unachievable levels of performance, management of energy consumption and the execution of scenarios which promise an ever more interconnected world. Here we have set out a vision for a reconfigurable computing system in the cloud, identified important research challenges and promising research directions and illustrated scenarios that are made possible by reconfigurable computing in the cloud.
Computing a Parallel Facebook Life
Soon there will be so much information about our lives entered into social network systems to allow us to perform calculations with this historical data to compute “what-if” scenarios i.e. compute a parallel life we could have had if we had made some different decision in the past. For example, if I had gone to Bill’s party instead of Craig’s party two years ago I would have met Bob there and never have met Anders etc. etc. We could synthesize an entire parallel life based on all the information about us in the form of our status updates, tagged photographs etc. which can be used to compute a new Facebook profile suggesting a different path our lives could have followed.
Alternatively, perhaps we could construct a Facebook profile and set of recent status updates we wish we had (sipping martinis at the W at Times Square!) and then we can try to compute backwards what changes we need to make to our actual Facebook profile and evolution to try and end up at a different point today.
Finally, once we have a sufficient convergence of technology and biology we can download the parallel Facebook profile of an imaged life and reflash our brains to make it our actual life. Of course, you need to reflash everyone else in your life (and everyone else that is not in life your but will be after the reflash).
Learn about alternative high performance computing at the FPGA 2011 conference
The ACM/SIGDA International Symposium on Field-Programmable Gate Arrays is the premier conference for presentation of advances in all areas related to FPGA technology. FPGAs for computing used to be confined to a niche market of engineers that were willing to carefully hand craft circuits to get extreme performance that could not be achieved with regular processors or GPUs. Today FPGAs are become one of the exciting new processing elements that form the heart of heterogeneous computing systems that combine the strengths of multicore processors, GPUs and spatial computing elements like FPGAs. The conference will be of interest to mainstream programmers and technology experts that want to learn about emerging trends in extreme computing.
The conference is taking place from 27 February to 1 March in Monterey and includes a pre-conference workshop on Sunday 27 February as well as regular sessions on Monday 28 and Tuesday 1 March. Registration is now open for the FPGA 2011 conference. Please visit the site http://isfpga.org to take advantage of early registration rates and also to book accommodation at the venue hotel.
You can also download a nice poster for the conference at http://isfpga.org/FPGA2011.pdf to print out and stick on your door, notices boards etc. to help advertise the conference. Thank you very much.
Kind regards,
Satnam Singh
Compiling C# Programs into FPGA Circuits: Factorial Example
This blog article gives a practical introduction to how C# programs can be compiled into FPGA circuits using the Kiwi system developed by myself and David Greaves at the University of Cambridge Computer Lab. Our starting point will be a C# program developed inside Visual Studio 2010 for computing factorial:
This program can be compiled and executed inside Visual Studio to report the factorial of 5 (i.e. 120).
However, using our Kiwi system this program can also be converted into a Verilog circuit. The program is decorated with some custom attributes which specify which static method should be turned into hardware; the number of bits used to represent input and output ports and how the circuit should behave relative to a circuit clock.
The generated Verilog netlist can be simulated using a Verilog simulator. The Modelsim simulator output produced by simulating this circuit with an input of 5 for the factorial circuit is shown below. After a few clock cycles the result 120 is produced and the done output bit goes high.
Or if you like you can run the simulation on the command line:
The generated Verilog can be used as input to the Xilinx design tools to generate a programming bit-stream to program an FPGA. The screen-shot below also includes a top-level module which instantiates some debugging input and sets the default input to the factorial circuit to be 5.
The generated programming bit-stream can then be run on a real FPGA board. I used a Xilinx ML605 which has a Xilinx Virtex-6 XC6VLX240T-1FG1156 FPGA on it:
I can check the functionality of the circuit by running a software logic analyzer on my PC which reports the value of internal signals of the circuit running on the FPGA.
The logic analyzer shows that this circuit indeed computes the factorial of 5 to be 120 after 4 clock cycles. This circuit is running at 100MHz.
You can get a warm and fuzzy feeling by looking at the floor-plan of the generated circuit (light blue specks on the middle left hand side) which also includes the debug circuitry required to support the logic analyzer:
This lower-level view shows the actual routing tracks for the wires.