

## Mirko Mariotti <sup>1,2</sup> Giulio Bianchini <sup>1</sup> Loriano Storchi <sup>3,2</sup> Daniele Spiga <sup>2</sup> Diego Ciangottini <sup>2</sup>

<sup>1</sup>Dipartimento di Fisica e Geologia, Universitá degli Studi di Perugia

<sup>2</sup>INFN sezione di Perugia

<sup>3</sup>Dipartimento di Farmacia, Universitá degli Studi G. D'Annunzio

ICSC FPGA Course, 27/06/2025





Inside the folder day5/bmexamples you will find the examples. They will work either on the terminal or on the Jupyter notebooks.

Each directory contains a project and is referred by a number in the slides (as for example shows the next slide).

ICSC FPGA Course, 27/06/2025



To install the BondMachine framework

Make it available in a Jupyter notebook



## Current challenges in computing

#### Von Neumann Bottleneck:

New computational problems show that current architectural models has to be improved or changed to address future payloads.

Energy Efficient computation:

Not wasting "resources" (silicon, time, energy, instructions). Using the right resource for the specific case

Edge/Fog/Cloud Computing: Making the computation where it make sense Avoiding the transfer of unnecessary data Creating consistent interfaces for distributed systems

Current challenges in computing

Von Neumann Bottleneck:

New computational problems show that current architectural models has to be improved or changed to address future payloads.

Energy Efficient computation: Not wasting "resources" (silicon, time, energy, instructions). Using the right resource for the specific case

Edge/Fog/Cloud Computing:
 Making the computation where it make sense
 Avoiding the transfer of unnecessary data
 Creating consistent interfaces for distributed systems

ICSC FPGA Course, 27/06/2025

Current challenges in computing

Von Neumann Bottleneck:

New computational problems show that current architectural models has to be improved or changed to address future payloads.

Energy Efficient computation: Not wasting "resources" (silicon, time, energy, instructions). Using the right resource for the specific case

Edge/Fog/Cloud Computing: Making the computation where it make sense Avoiding the transfer of unnecessary data Creating consistent interfaces for distributed systems

ICSC FPGA Course, 27/06/2025

A field programmable gate array (FPGA) is an integrated circuit whose logic is re-programmable.

- Parallel computing Highly specialized
- Energy efficient





- Array of programmable logic blocks
  - Logic blocks configurable to perform complex functions
- The configuration is specified with the hardware description language



The use of FPGA in computing is growing due several reasons:

can potentially deliver great performance via massive parallelism

can address payloads which are not performing well on uniprocessors (Neural Networks, Deep Learning)

can handle efficiently non-standard data types

ICSC FPGA Course, 27/06/2025



The use of FPGA in computing is growing due several reasons:

can potentially deliver great performance via massive parallelism

can address payloads which are not performing well on uniprocessors (Neural Networks, Deep Learning)

can handle efficiently non-standard data types



The use of FPGA in computing is growing due several reasons:

can potentially deliver great performance via massive parallelism

can address payloads which are not performing well on uniprocessors (Neural Networks, Deep Learning)

can handle efficiently non-standard data types



FPGAs are playing an increasingly important role in the industry sampling and data processing.





**Deep Learning** 

In the industrial field

- Intelligent vision;
- Financial services;
- Scientific simulations;
- Life science and medical data analysis;

In the scientific field

- Real time deep learning in particle physics;
- Hardware trigger of LHC experiments;

intel

るるでである

And many others ...



On the other hand the adoption on FPGA poses several challenges:

Porting of legacy code is usually hard.

Interoperability with standard applications is problematic.



On the other hand the adoption on FPGA poses several challenges:

Porting of legacy code is usually hard.

Interoperability with standard applications is problematic.

# Firmware generation

Many projects have the goal of abstracting the firmware generation and use process.



Today's computer architecture are:

Multi-core, Two or more independent actual processing units execute multiple instructions at the same time.

- The power is given by the number of cores.
- Parallelism has to be addressed.

### Heterogeneous, different types of processing units.

- Cell, GPU, Parallela, TPU.
- The power is given by the specialization.
- The units data transfer has to be addressed.
- The payloads scheduling has to be addressed

Today's computer architecture are:

Multi-core, Two or more independent actual processing units execute multiple instructions at the same time.

The power is given by the number of cores.
Parallelism has to be addressed.

- Cell, GPU, Parallela, TPU.
- The power is given by the specialization.
- The units data transfer has to be addressed
- The payloads scheduling has to be addressed

Today's computer architecture are:

Multi-core, Two or more independent actual processing units execute multiple instructions at the same time.

- The power is given by the number of cores.
- Parallelism has to be addressed.

Heterogeneous, different types of processing units

- Cell, GPU, Parallela, TPU.
- The power is given by the specialization.
- The units data transfer has to be addressed
- The payloads scheduling has to be addressed

ICSC FPGA Course, 27/06/2025

Today's computer architecture are:

Multi-core, Two or more independent actual processing units execute multiple instructions at the same time.

- The power is given by the number of cores.
- > Parallelism has to be addressed.

Heterogeneous, different types of processing units.

▶ Cell, GPU, Parallela, TPU.

The power is given by the specialization.

- The units data transfer has to be addressed
- The payloads scheduling has to be addressed

ICSC FPGA Course, 27/06/2025

Today's computer architecture are:

Multi-core, Two or more independent actual processing units execute multiple instructions at the same time.

- The power is given by the number of cores.
- Parallelism has to be addressed.

- Cell, GPU, Parallela, TPU.
- The power is given by the specialization.
- The units data transfer has to be addressed.
- The payloads scheduling has to be addressed.

Today's computer architecture are:

Multi-core, Two or more independent actual processing units execute multiple instructions at the same time.

- The power is given by the number of cores.
- Parallelism has to be addressed.

Heterogeneous, different types of processing units.

▶ Cell, GPU, Parallela, TPU.

The power is given by the specialization.

- The units data transfer has to be addressed.
- The payloads scheduling has to be addressed.

ICSC FPGA Course, 27/06/2025

Today's computer architecture are:

Multi-core, Two or more independent actual processing units execute multiple instructions at the same time.

- The power is given by the number of cores.
- Parallelism has to be addressed.

Heterogeneous, different types of processing units.

- Cell, GPU, Parallela, TPU.
- > The power is given by the specialization.
- The units data transfer has to be addressed.
- The payloads scheduling has to be addressed.

ICSC FPGA Course, 27/06/2025

Today's computer architecture are:

Multi-core, Two or more independent actual processing units execute multiple instructions at the same time.

- The power is given by the number of cores.
- Parallelism has to be addressed.

- Cell, GPU, Parallela, TPU.
- The power is given by the specialization.
- The units data transfer has to be addressed.
- The payloads scheduling has to be addressed

Today's computer architecture are:

Multi-core, Two or more independent actual processing units execute multiple instructions at the same time.

- The power is given by the number of cores.
- Parallelism has to be addressed.

- Cell, GPU, Parallela, TPU.
- The power is given by the specialization.
- The units data transfer has to be addressed.
- The payloads scheduling has to be addressed

Today's computer architecture are:

Multi-core, Two or more independent actual processing units execute multiple instructions at the same time.

- The power is given by the number of cores.
- Parallelism has to be addressed.

- Cell, GPU, Parallela, TPU.
- The power is given by the specialization.
- The units data transfer has to be addressed.
- The payloads scheduling has to be addressed.

## Processors in FPGA



It is a common practice to use FPGAs to implement processors. Some processors are directly created by the FPGA manufacturer, some are open-source, some are proprietary.

Now with the advent of the RISC-V architecture, it is clear that this will be more and more used in the future.

This is the also the case of the BondMachine project but with a different approach.













# Introducing the BondMachine (BM)

The BondMachine is a software ecosystem for the dynamic generation of computer architectures that:

Are composed by many, possibly hundreds, computing cores.

- Have very small cores and not necessarily of the same type (different ISA and ABI).
- Have a not fixed way of interconnecting cores.
- May have some elements shared among cores (for example channels and shared memories).

# Introducing the BondMachine (BM)

The BondMachine is a software ecosystem for the dynamic generation of computer architectures that:

Are composed by many, possibly hundreds, computing cores.

- Have very small cores and not necessarily of the same type (different ISA and ABI).
- Have a not fixed way of interconnecting cores.
- May have some elements shared among cores (for example channels and shared memories).

# Introducing the BondMachine (BM)

The BondMachine is a software ecosystem for the dynamic generation of computer architectures that:

Are composed by many, possibly hundreds, computing cores.

Have very small cores and not necessarily of the same type (different ISA and ABI).

Have a not fixed way of interconnecting cores.

May have some elements shared among cores (for example channels and shared memories).
## Introducing the BondMachine (BM)

The BondMachine is a software ecosystem for the dynamic generation of computer architectures that:

Are composed by many, possibly hundreds, computing cores.

Have very small cores and not necessarily of the same type (different ISA and ABI).Have a not fixed way of interconnecting cores.

May have some elements shared among cores (for example channels and shared memories).

## Introducing the BondMachine (BM)

The BondMachine is a software ecosystem for the dynamic generation of computer architectures that:

Are composed by many, possibly hundreds, computing cores.

- Have very small cores and not necessarily of the same type (different ISA and ABI).
- Have a not fixed way of interconnecting cores.
- May have some elements shared among cores (for example channels and shared memories).



The computational unit of the BM

The atomic computational unit of a BM is the "connecting processor" (CP) and has:

Some general purpose registers of size Rsize. Some I/O dedicated registers of size Rsize. A set of implemented opcodes chosen among many available. Dedicated ROM and RAM. Three possible operating modes.

The computational unit of the BM

The atomic computational unit of a BM is the "connecting processor" (CP) and has:

Some general purpose registers of size Rsize.

Some I/O dedicated registers of size Rsize.

A set of implemented opcodes chosen among many available.

Dedicated ROM and RAM.

Three possible operating modes.

#### General purpose registers

 $2^R$  registers: r0,r1,r2,r3 ... r $2^R$ 

The computational unit of the BM

The atomic computational unit of a BM is the "connecting processor" (CP) and has:



#### Some I/O dedicated registers of size Rsize.

#### I/O specialized registers

N input registers: i0,i1 ... iN M output registers: o0,o1 ... oM

ICSC FPGA Course, 27/06/2025

The computational unit of the BM

The atomic computational unit of a BM is the "connecting processor" (CP) and has:

Some general purpose registers of size Rsize. Some I/O dedicated registers of size Rsize.

A set of implemented opcodes chosen among many available. Dedicated ROM and RAM.

Three possible operating modes.

#### Full set of possible opcodes

adc, add, addf, addf 16, addi, addp, and, chc, chw, cil, cilc, cir, cirn, clc, clr, cmpr, cpy, cset, dec, divdivf, divf, divf 16, divp, dpc, expf, hit, hlt, i2r, i2rw, incc, inc, j, ja, jc, jcmpa, jcmpl, jcmpo, jcmpriajcmprio, je, jri, jria, jrio, jgt 0f, jo, jz, k2r, lfsr 82r, m2r, m2rri, mod, mulc, mult, multf, multf 16 multp, nand, nop, nor, not, or, q2r, r2m, r2mri, r2o, r2owa, r2owaa, r2q, r2s, r2v, r2vri, r2t, r2u, ro2r ro2rri, rsc, rset, sic, s2r, saj, sbc, sub, t2r, u2r, wrd, wwr, xnor, xor

The computational unit of the BM

The atomic computational unit of a BM is the "connecting processor" (CP) and has:

Some general purpose registers of size Rsize. Some I/O dedicated registers of size Rsize. A set of implemented opcodes chosen among many available. Dedicated ROM and RAM.

Three possible operating modes.

#### $\mathsf{RAM} \text{ and } \mathsf{ROM}$

- 2<sup>L</sup> RAM memory cells.
- 2<sup>0</sup> ROM memory cells.

ICSC FPGA Course, 27/06/2025

The computational unit of the BM

The atomic computational unit of a BM is the "connecting processor" (CP) and has:

Some general purpose registers of size Rsize. Some I/O dedicated registers of size Rsize. A set of implemented opcodes chosen among many available. Dedicated ROM and RAM.

Three possible operating modes.

#### Operating modes

- Full Harvard mode.
- Full Von Neuman mode.
- Hybrid mode.

## Shared Objects (SO)



more about these

ICSC FPGA Course, 27/06/2025



Having a multi-core architecture completely heterogeneous both in cores types and interconnections.

The BondMachine may have many cores, eventually all different, arbitrarily interconnected and sharing non computing elements.

The BM computer architecture is managed by a set of tools to:

build a specify architecture

modify a pre-existing architecture

simulate or emulate the behavior

generate the Hardware Description Language Code (HDL)

Processor Builder

Selects the single processor, assembles and disassembles, saves on disk as JSON, creates the HDL code of a CP BondMachine Builder

Connects CPs and SOs together in custom topologies, loads and saves on disk as JSON, create BM's HDL code Simulates the behaviour, emulates a BM on a standard Linux workstation

The BM computer architecture is managed by a set of tools to:

build a specify architecture

modify a pre-existing architecture

simulate or emulate the behavior

generate the Hardware Description Language Code (HDL)

Processor Builder

Selects the single processor, assembles and disassembles, saves on disk as JSON, creates the HDL code of a CP BondMachine Builder

Connects CPs and SOs together in custom topologies, loads and saves on disk as JSON, create BM's HDL code Simulates the behaviour, emulates a BM on a standard Linux workstation

The BM computer architecture is managed by a set of tools to:

build a specify architecture

modify a pre-existing architecture

simulate or emulate the behavior

generate the Hardware Description Language Code (HDL)

Processor Builder

Selects the single processor, assembles and disassembles, saves on disk as JSON, creates the HDL code of a CP BondMachine Builder

Connects CPs and SOs together in custom topologies, loads and saves on disk as JSON, create BM's HDL code Simulation Framework Simulates the behaviour, emulates a BM on a standard Linux workstation

The BM computer architecture is managed by a set of tools to:

build a specify architecture

modify a pre-existing architecture

simulate or emulate the behavior

generate the Hardware Description Language Code (HDL)

Processor Builder

Selects the single processor, assembles and disassembles, saves on disk as JSON, creates the HDL code of a CP BondMachine Builder

Connects CPs and SOs together in custom topologies, loads and saves on disk as JSON, create BM's HDL code Simulation Framework

Simulates the behaviour, emulates a BM on a standard Linux workstation



#### Examples

(32 bit registers counter machine)

procbuilder -register-size 32 -opcodes clr,cpy,dec,inc,je,jz

(Input and Output registers)

procbuilder -inputs 3 -outputs 2 ...

ICSC FPGA Course, 27/06/2025



## Examples (Loading a CP) procbuilder -load-machine conproc.json ... (Saving a CP) procbuilder -save-machine conproc.json ...

ICSC FPGA Course, 27/06/2025





ICSC FPGA Course, 27/06/2025



#### Examples

(Create the CP RTL code in Verilog) procbuilder -create-verilog ...

(Create testbench)

procbuilder -create-verilog-testbench test.v ...

ICSC FPGA Course, 27/06/2025



To create a simple processor

To assemble and disassemble code for it

To produce its HDL code

Bondmachine is the tool that compose CP and SO to form BondMachines.

BM CP insert and remove BM SO insert and remove BM Inputs and Outputs BM Bonding Processors and/or IO BM Visualizing or HDL

#### Examples

(Add a processor)

bondmachine -add-domains proc.json ... ; ... -add-processor 0

(Remove a processor)

bondmachine -bondmachine-file bmach.json -del-processor n

ICSC FPGA Course, 27/06/2025

Bondmachine is the tool that compose CP and SO to form BondMachines.

### BM SO insert and remove

BM Inputs and Outputs BM Bonding Processors and/or IO BM Visualizing or HDL

#### Examples

(Add a Shared Object) bondmachine -add-shared-objects specs ...

(Connect an SO to a processor)

bondmachine -connect-processor-shared-object ...

ICSC FPGA Course, 27/06/2025

Bondmachine is the tool that compose CP and SO to form BondMachines.

#### BM CP insert and remove BM SO insert and remove BM Inputs and Outputs

BM Bonding Processors and/or IO BM Visualizing or HDL

#### Examples

(Adding inputs or outputs) bondmachine -add-inputs ... ; bondmachine -add-outputs ...

(Removing inputs or outputs)

bondmachine -del-input ... ; bondmachine -del-output ...

ICSC FPGA Course, 27/06/2025

Bondmachine is the tool that compose CP and SO to form BondMachines.

BM CP insert and remove BM SO insert and remove BM Inputs and Outputs BM Bonding Processors and/or IO BM Visualizing or HDL

> (Bonding processor) bondmachine -add-bond p0i2,p104 ...

(Bonding IO) bondmachine -add-bond i2,p0i6 ...

ICSC FPGA Course, 27/06/2025

Examples

Bondmachine is the tool that compose CP and SO to form BondMachines.

BM CP insert and remove BM SO insert and remove BM Inputs and Outputs BM Bonding Processors and/or IO BM Visualizing or HDL

> (Visualizing) bondmachine -emit-dot ...

> > (Create RTL code)

bondmachine -create-verilog ...

ICSC FPGA Course, 27/06/2025

Examples



To create a single-core BondMachine

To attach an external output

To produce its HDL code

#### Toolchain and helper tool

A BondMachine Project is a directory containing all the necessary files to build a BondMachine.

A set of tools have been developed to simplify the creation and maintenance of the BM Projects.





A set of toolchain allow the build and the direct deploy to a target device of BondMachines.

Plus, an helper tool, called *bmhelper* has been developed to simplify the creation and maintenance of the BM Projects.

doctor Checks whether the tools are correctly installed

create Creates a new BM project

#### validate Validates a BM project by checking the presence of all the necessary variables

apply Finalizes the BM project by adding the necessary files

## Toolchain and helper tool

#### Makefile

#### Toolchain main targets

A file local mk contains references to the source code as well all the build necessities make bondmachine creates the JSON representation of the BM and assemble its code make hdl creates the HDL files of the BM make show displays a graphical representation of the BM make simulate [simbatch] start a simulation [batch simulation] make accelerator create an accelerator IP from the BM make design create an accelerator design make bitstream [design bitstream] create the firwware [accelerator firmware] make program flash the device into the destination target make xclbin create a platform firmware make clean remove all the build files

#### Toolchain and helper tool Kernel config style

Complementary to the Makefile and the local.mk file, a kernel config style file is used to specify the build operations.

#### Configuration Arrow keys navigate the menu. <free> selects submenus ---> (or empty submenus --->). Highlighted letters are hotkeys. Press <free> to exit, <>> for Help, </> for Search. Legend: ['] built-in [] excluded <H> module <>> enorfal ---> Board and toolchain ---> External modules ---> Testing and debugging --->

ICSC FPGA Course, 27/06/2025



To explore the toolchain

To flash the board with the code from the previous example

Hands-on 03 bis is similar

ICSC FPGA Course, 27/06/2025



To build a BondMachine with a processor and a shared object

To flash the board



To build a dual-core BondMachine

To connect cores

To flash the board

# Simulation

An important feature of the tools is the possibility of simulating BondMachine behavior.

An event input file describes how BondMachines elements has to change during the simulation timespan and which one has to be be reported.

The simulator can produce results in the form of:

- Activity log of the BM internal.
- Graphical representation of the simulation.
- Report file with quantitative data. Useful to construct metrics

#### Graphical simulation in action

## Simulation

An important feature of the tools is the possibility of simulating BondMachine behavior.

An event input file describes how BondMachines elements has to change during the simulation timespan and which one has to be be reported.

The simulator can produce results in the form of:

- Activity log of the BM internal.
- Graphical representation of the simulation.
- Report file with quantitative data. Useful to construct metrics

Graphical simulation in action



An important feature of the tools is the possibility of simulating BondMachine behavior.

An event input file describes how BondMachines elements has to change during the simulation timespan and which one has to be be reported.

The simulator can produce results in the form of:

- Activity log of the BM internal.
- Graphical representation of the simulation.
- Report file with quantitative data. Useful to construct metrics

#### Graphical simulation in action


To show the simulation capabilities of the framework



The same engine that simulate BondMachines can be used as emulator.

Through the emulator BondMachines can be used on Linux workstations.

ICSC FPGA Course, 27/06/2025

### Molding the BondMachine

Main tools

# As stated before BondMachines are not general purpose architectures, and to be effective have to be shaped according the specific problem.

Several methods (apart from writing in assembly and building a BondMachine from scratch) have been developed to do that:

- **bondgo**: A new type of compiler that create not only the CPs assembly but also the architecture itself.
- *basm*: The BondMachine Assembler.
- A set of tools to use BondMachine in Machine Learning.
- *bmqsim*: A quantum computer simulator.

### Molding the BondMachine

Main tools

As stated before BondMachines are not general purpose architectures, and to be effective have to be shaped according the specific problem.

Several methods (apart from writing in assembly and building a BondMachine from scratch) have been developed to do that:

*bondgo*: A new type of compiler that create not only the CPs assembly but also the architecture itself.

*basm*: The BondMachine Assembler.

A set of tools to use BondMachine in Machine Learning.

As stated before BondMachines are not general purpose architectures, and to be effective have to be shaped according the specific problem.

Several methods (apart from writing in assembly and building a BondMachine from scratch) have been developed to do that:

**bondgo**: A new type of compiler that create not only the CPs assembly but also the architecture itself.

*basm*: The BondMachine Assembler.

A set of tools to use BondMachine in Machine Learning.

As stated before BondMachines are not general purpose architectures, and to be effective have to be shaped according the specific problem.

Several methods (apart from writing in assembly and building a BondMachine from scratch) have been developed to do that:

**bondgo**: A new type of compiler that create not only the CPs assembly but also the architecture itself.

*basm*: The BondMachine Assembler.

A set of tools to use BondMachine in Machine Learning.

As stated before BondMachines are not general purpose architectures, and to be effective have to be shaped according the specific problem.

Several methods (apart from writing in assembly and building a BondMachine from scratch) have been developed to do that:

bondgo: A new type of compiler that create not only the CPs assembly but also the architecture itself.

*basm*: The BondMachine Assembler.

A set of tools to use BondMachine in Machine Learning.

As stated before BondMachines are not general purpose architectures, and to be effective have to be shaped according the specific problem.

Several methods (apart from writing in assembly and building a BondMachine from scratch) have been developed to do that:

**bondgo**: A new type of compiler that create not only the CPs assembly but also the architecture itself.

*basm*: The BondMachine Assembler.

A set of tools to use BondMachine in Machine Learning.



Bondgo is the name chosen for the compiler developed for the BondMachine.

The compiler source language is Go as the name suggest.

### This is the standard flow when building computer programs

This is the standard flow when building computer programs

high level language source







### Bondgo does something different from standard compilers ...

### Bondgo does something different from standard compilers ...

high level GO source

ICSC FPGA Course, 27/06/2025



































To create a BondMachine from a Go source file

- To build the architecture
- To build the program
- To create the firmware and flash it to the board

... it can do even much more interesting things when compiling concurrent programs.

... it can do even much more interesting things when compiling concurrent programs.

high level GO source








ICSC FPGA Course, 27/06/2025







Compiling the code with the bondgo compiler:

bondgo -input-file ds.go -mpm

The toolchain perform the following steps:

- Map the two goroutines to two hardware cores.
- Creates two types of core, each one optimized to execute the assigned goroutine.
- Creates the two binaries.
- Connected the two core as inferred from the source code, using special IO registers. The result is a multicore BondMachine:



| Compiling Architectures |  |
|-------------------------|--|
|                         |  |

#### One of the most important result

The architecture creation is a part of the compilation process.

ICSC FPGA Course, 27/06/2025



To use bondgo to create a chain of interconnected processors

To flash the firmware to the board

The BondMachine assembler Basm is the compiler complementary tools.

It is a standard assembler that can be used to build code for the BondMachine. Given the "fluid" nature of the BM architectures, BASM has some unique features:

Support for code fragments

Support for template based assembly code

Fragments composition: combining and rewriting

- Building hardware from assembly
- Software/Hardware rearrange capabilities
- LLVM IR import

ICSC FPGA Course, 27/06/2025

The BondMachine assembler Basm is the compiler complementary tools.

It is a standard assembler that can be used to build code for the BondMachine. Given the "fluid" nature of the BM architectures, BASM has some unique features:

Support for code fragments

Support for template based assembly code

Fragments composition: combining and rewriting

Building hardware from assembly

Software/Hardware rearrange capabilities

LLVM IR import

ICSC FPGA Course, 27/06/2025

The BondMachine assembler Basm is the compiler complementary tools.

It is a standard assembler that can be used to build code for the BondMachine. Given the "fluid" nature of the BM architectures, BASM has some unique features:

Support for code fragments

Support for template based assembly code

Fragments composition: combining and rewriting

Building hardware from assembly

Software/Hardware rearrange capabilities

LLVM IR import

ICSC FPGA Course, 27/06/2025

The BondMachine assembler Basm is the compiler complementary tools.

It is a standard assembler that can be used to build code for the BondMachine. Given the "fluid" nature of the BM architectures, BASM has some unique features:

Support for code fragments

Support for template based assembly code

Fragments composition: combining and rewriting

Building hardware from assembly

Software/Hardware rearrange capabilities

LLVM IR import

ICSC FPGA Course, 27/06/2025

The BondMachine assembler Basm is the compiler complementary tools.

It is a standard assembler that can be used to build code for the BondMachine. Given the "fluid" nature of the BM architectures, BASM has some unique features:

Support for code fragments

Support for template based assembly code

Fragments composition: combining and rewriting

Building hardware from assembly

Software/Hardware rearrange capabilities

LLVM IR import

The BondMachine assembler Basm is the compiler complementary tools.

It is a standard assembler that can be used to build code for the BondMachine. Given the "fluid" nature of the BM architectures, BASM has some unique features:

Support for code fragments

Support for template based assembly code

Fragments composition: combining and rewriting

Building hardware from assembly

Software/Hardware rearrange capabilities

LLVM IR import

ICSC FPGA Course, 27/06/2025

| Bas |                                                                                                                                                                                         |
|-----|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|     | xample<br>basm example                                                                                                                                                                  |
|     | %section code1 .romtext<br>entry _start ; Entry point<br>_start:                                                                                                                        |
|     | clr r0<br>rset r0,49<br>rset r1,45<br>mov vtm0:[r1], r0<br>rset r0, 50<br>r2v r0, 128<br>clr r0<br>j_start                                                                              |
|     | %endsection                                                                                                                                                                             |
|     | Xmeta cpdef cpul romcode: codel, rammize:8<br>Xmeta sodef videomemory constraint:vtextmem:0:3:3:16:16<br>Xmeta soatt videomemory cp: cpul, index:0<br>Xmeta bmdef global registersize:8 |

ICSC FPGA Course, 27/06/2025



To create a BondMachine from a Basm source file

- To build the accelerator
- To build the xclbin
- To upload the xclbin to the board and use it



To create complex multi-cores from boolean expressions



#### Depending on the board, several ways of using BM as accelerators are possible:

USB connection: BM and host connected via USB. A custom protocol over serial is used to communicate with the board (BMMRP).

AXI MM on SoC (kernel): The BM and the PS are on the same chip and the communication is done via AXI MM. BMMRP is also used here but implemented in custom kernel module.

- AXI MM on Soc (Pynq): The BM and the PS are on the same chip and the communication is done via AXI MM. The Pynq framework is used for the BM.
- AXI Stream on Soc (Pynq): The BM and the PS are on the same chip and the communication is done via AXI Stream. The Pynq framework is used for the BM.
- AXI Stream on PCIe (Pynq): The BM is connected to the host PC via PCIe and the communication is done via AXI Stream, the XRT platform is used to communicate with the BM via Pynq.

Depending on the board, several ways of using BM as accelerators are possible:

USB connection: BM and host connected via USB. A custom protocol over serial is used to communicate with the board (BMMRP).

AXI MM on SoC (kernel): The BM and the PS are on the same chip and the communication is done via AXI MM. BMMRP is also used here but implemented in custom kernel module.

- AXI MM on Soc (Pynq): The BM and the PS are on the same chip and the communication is done via AXI MM. The Pynq framework is used for the BM.
- AXI Stream on Soc (Pynq): The BM and the PS are on the same chip and the communication is done via AXI Stream. The Pynq framework is used for the BM.
- AXI Stream on PCIe (Pynq): The BM is connected to the host PC via PCIe and the communication is done via AXI Stream, the XRT platform is used to communicate with the BM via Pynq.

ICSC FPGA Course, 27/06/2025

Depending on the board, several ways of using BM as accelerators are possible:

USB connection: BM and host connected via USB. A custom protocol over serial is used to communicate with the board (BMMRP).

AXI MM on SoC (kernel): The BM and the PS are on the same chip and the communication is done via AXI MM. BMMRP is also used here but implemented in custom kernel module.

AXI MM on Soc (Pynq): The BM and the PS are on the same chip and the communication is done via AXI MM. The Pynq framework is used for the BM.

AXI Stream on Soc (Pynq): The BM and the PS are on the same chip and the communication is done via AXI Stream. The Pynq framework is used for the BM.

AXI Stream on PCIe (Pynq): The BM is connected to the host PC via PCIe and the communication is done via AXI Stream, the XRT platform is used to communicate with the BM via Pynq.

Depending on the board, several ways of using BM as accelerators are possible:

- USB connection: BM and host connected via USB. A custom protocol over serial is used to communicate with the board (BMMRP).
  - AXI MM on SoC (kernel): The BM and the PS are on the same chip and the communication is done via AXI MM. BMMRP is also used here but implemented in custom kernel module.
- AXI MM on Soc (Pynq): The BM and the PS are on the same chip and the communication is done via AXI MM. The Pynq framework is used for the BM.
- AXI Stream on Soc (Pynq): The BM and the PS are on the same chip and the communication is done via AXI Stream. The Pynq framework is used for the BM.
- AXI Stream on PCIe (Pynq): The BM is connected to the host PC via PCIe and the communication is done via AXI Stream, the XRT platform is used to communicate with the BM via Pynq.

Depending on the board, several ways of using BM as accelerators are possible:

- USB connection: BM and host connected via USB. A custom protocol over serial is used to communicate with the board (BMMRP).
- AXI MM on SoC (kernel): The BM and the PS are on the same chip and the communication is done via AXI MM. BMMRP is also used here but implemented in custom kernel module.
- AXI MM on Soc (Pynq): The BM and the PS are on the same chip and the communication is done via AXI MM. The Pynq framework is used for the BM.
- AXI Stream on Soc (Pynq): The BM and the PS are on the same chip and the communication is done via AXI Stream. The Pynq framework is used for the BM.
- AXI Stream on PCIe (Pynq): The BM is connected to the host PC via PCIe and the communication is done via AXI Stream, the XRT platform is used to communicate with the BM via Pynq.

Depending on the board, several ways of using BM as accelerators are possible:

- USB connection: BM and host connected via USB. A custom protocol over serial is used to communicate with the board (BMMRP).
  - AXI MM on SoC (kernel): The BM and the PS are on the same chip and the communication is done via AXI MM. BMMRP is also used here but implemented in custom kernel module.
- AXI MM on Soc (Pynq): The BM and the PS are on the same chip and the communication is done via AXI MM. The Pynq framework is used for the BM.
  - AXI Stream on Soc (Pynq): The BM and the PS are on the same chip and the communication is done via AXI Stream. The Pynq framework is used for the BM.
- AXI Stream on PCIe (Pynq): The BM is connected to the host PC via PCIe and the communication is done via AXI Stream, the XRT platform is used to communicate with the BM via Pynq.



GCC with -O0











## Interconnection firmware

The input and output buses are the endpoints that we would like to have on the linux system.



clk 2 Input Custom N HW design 1 2 Output

ICSC FPGA Course, 27/06/2025

## Interconnection firmware

The input and output buses are the endpoints that we would like to have on the linux system.



The BondMachine Project

PS (arm)

FPGA

Custom HW

design



ICSC FPGA Course, 27/06/2025

The Advanced eXtensible Interface Protocol

AXI is a communication bus protocol defined by ARM as part of the Advanced Microcontroller Bus Architecture (AMBA) standard. There are 3 types of AXI Interfaces:

AXI Full: for high-performance memory-mapped requirements. AXI Lite: for low-throughput memory-mapped communication.

AXI Stream: for high-speed streaming data.

|                  |                           | Name                                        | 500_AXI | 0         |
|------------------|---------------------------|---------------------------------------------|---------|-----------|
|                  | © Interfaces<br>© 500_AXI | Interface Type                              | UN      | ~         |
|                  |                           | Interface Mode                              | Slave   | ~         |
|                  |                           | e Data Width (Bits)<br>Merrory Size (Dytes) | 32      |           |
|                  |                           |                                             |         | ·         |
| - S00_AXI        | Þ                         | Number of Registers                         |         | 0 [4.512] |
| bondmachine_v1.0 |                           | in the constant                             |         | o hout    |
|                  |                           |                                             |         |           |



ICSC FPGA Course, 27/06/2025



ICSC FPGA Course, 27/06/2025

## Linux

Now that we have a custom accelerated hardware, we need a Linux distro to run on it.

#### **Common Features**

Complete system build from source Allow choice of kernel and bootloader Support for modifying packages with patches or custom configuration files Can build cross-toolchains for development Convenient support for read-only root filesystems Support offline builds The build configuration files integrate well with SCM tools

#### Yocto

Convenient sharing of build configuration among similar projects (meta-layers) Larger community (Linux Foundation project) Can build a toolchain that runs on the target A package management system

#### Buildroot

Simple Makefile approach, easier to understand how the build system works Reduced resource requirements on the build machine Very easy to customize the final root filesystem (overlays)

Credits: https://jumpnowtek.com/linux/Choosing-an-embedded-linux-build-system.html

ICSC FPGA Course, 27/06/2025




#### kernel module

The accelerator endpoints are exposed via AXI memory-mapped as memory location of the arm processor running Linux.

To properly use the accelerator from user space, the kernel has to handle the accelerator endpoints and make them available to user space.

We developed a kernel module for our accelerators. It manages 3 data flows:





#### Kernel from and to user space: char device

The communication are through the standard read and write system call on a kernel generated char device

A language has been implemented for the desired operations





AXI guarantees consistency and transfer to the firmware input ports. Moreover the data flow from kernel cannot saturate the PL part.

ICSC FPGA Course, 27/06/2025

The BondMachine Project

PS (arm)

App

Linux based OS

Firmware to kernel: IRQ

Different story is the data flow from the FPGA to the PS part. Data can easily flow so fast to saturate and make the PS part completely unusable.

The firmware collect all the changes to send and fill in a list using a dedicated AXI register



Different story is the data flow from the FPGA to the PS part. Data can easily flow so fast to saturate and make the PS part completely unusable.

The firmware collect all the changes to send and fill in a list using a dedicated AXI register

Stop accepting new changes from the IP



Firmware to kernel: IRQ

Different story is the data flow from the FPGA to the PS part. Data can easily flow so fast to saturate and make the PS part completely unusable.

The firmware collect all the changes to send and fill in a list using a dedicated AXI register

Stop accepting new changes from the IP





Different story is the data flow from the FPGA to the PS part. Data can easily flow so fast to saturate and make the PS part completely unusable.



The BondMachine Project

FPGA

Custom HW design

Interconnec

Wires

Different story is the data flow from the FPGA to the PS part. Data can easily flow so fast to saturate and make the PS part completely unusable.



The BondMachine Project

FPGA

Custom HW design

Interconnec

Wires C

Different story is the data flow from the FPGA to the PS part. Data can easily flow so fast to saturate and make the PS part completely unusable.



ICSC FPGA Course, 27/06/2025

The BondMachine Project

PS (arm)

FPGA

Custom HW design

Interconnec

Wires









Check of the correctness of the accelerator results

Benchmark of the execution













Correctness and module debug

# To verify the correct computation of the accelerator:

a tool to monitor the AXI memory

write directly to AXI memory mapped input addresses (through devmem)

| # ./mon  | itor -a 0: | x43c00000 -n i | 3 |              |              |              |
|----------|------------|----------------|---|--------------|--------------|--------------|
|          |            | (0x43c00003)   |   | (0x43c00002) | (0x43c00001) | (0x43c00000) |
|          |            | (0x43c00007)   |   | (0x43c00006) | (0x43c00005) | (0x43c00004) |
|          |            | (0x43c0000b)   |   | (0x43c0000a) | (0x43c00009) | (0x43c00008) |
|          |            | (0x43c0000f)   |   | (0x43c0000e) | (0x43c0000d) | (0x43c0000c) |
|          |            | (0x43c00013)   |   | (0x43c00012) | (0x43c00011) | (0x43c00010) |
|          |            | (0x43c00017)   |   | (0x43c00016) | (0x43c00015) | (0x43c00014) |
|          |            | (0x43c0001b)   |   | (0x43c0001a) | (0x43c00019) | (0x43c00018) |
|          |            | (0x43c0001f)   |   | (0x43c0001e) | (0x43c0001d) | (0x43c0001c) |
| PS2PL:   |            | (0x43c00023)   |   | (0x43c00022) | (0x43c00021) | (0x43c00020) |
| STATES:  |            | (0x43c00027)   |   |              |              |              |
|          |            | (0x43c0002b)   |   |              |              |              |
|          |            | (0x43c0002f)   |   |              |              |              |
|          |            | (0x43c00033)   |   | (0x43c00032) | (0x43c00031) | (0x43c00030) |
|          |            | (0x43c00037)   |   |              |              |              |
|          |            | (0x43c0003b)   |   |              |              |              |
|          |            | (0x43c0003f)   |   |              |              |              |
|          |            | (0x43c00043)   |   |              |              |              |
|          |            | (0x43c00047)   |   |              |              |              |
| bench:   |            | (0x43c0004b)   |   |              |              |              |
| PL2PS:   |            | (0x43c0004f)   |   |              |              |              |
| CHANGE : |            | (0x43c00053)   |   | (0x43c00052) | (0x43c00051) | (0x43c00050) |
|          |            |                |   |              |              |              |

check the AXI memory mapped output addresses

Correctness and module debug

To verify the correct computation of the accelerator:

a tool to monitor the AXI memory

write directly to AXI memory mapped input addresses (through devmem)

check the AXI memory mapped output addresses

| 1 | # /mon   | itor a Au | x43c00000 -n | > |              |              |              |
|---|----------|-----------|--------------|---|--------------|--------------|--------------|
|   | iG:      |           | (0x43c00003) |   | (0-12-0000)  | (8-42-00001) | (0-12-00000) |
|   | i1:      |           | (0x43c00003) |   |              |              |              |
|   |          |           |              |   |              |              |              |
|   |          |           | (0x43c0000b) |   |              |              |              |
|   |          |           | (0x43c0000f) |   |              |              |              |
|   |          |           | (0x43c00013) |   |              |              |              |
|   |          |           | (0x43c00017) |   | (0x43c00016) | (0x43c00015) | (0x43c00014) |
|   |          |           | (0x43c0001b) |   | (0x43c0001a) | (0x43c00019) | (0x43c00018) |
|   |          |           | (0x43c0001f) |   | (0x43c0001e) | (0x43c0001d) | (0x43c0001c) |
|   | PS2PL:   |           | (0x43c00023) |   | (0x43c00022) | (0x43c00021) | (0x43c00020) |
|   | STATES:  |           | (0x43c00027) |   | (0x43c00026) | (0x43c00025) | (0x43c00024) |
|   |          |           | (0x43c0002b) |   |              |              |              |
|   |          |           | (0x43c0002f) |   | (0x43c0002e) | (0x43c0002d) | (0x43c0002c) |
|   |          |           | (0x43c00033) |   | (0x43c00032) | (0x43c00031) | (0x43c00030) |
|   |          |           | (0x43c00037) |   | (0x43c00036) | (0x43c00035) | (0x43c00034) |
|   |          |           | (0x43c0003b) |   | (0x43c0003a) | (0x43c00039) | (0x43c00038) |
|   |          |           | (0x43c0003f) |   | (0x43c0003e) | (0x43c0003d) | (0x43c0003c) |
|   |          |           | (0x43c00043) |   | (0x43c00042) | (0x43c00041) | (0x43c00040) |
|   |          |           | (0x43c00047) |   | (0x43c00046) | (0x43c00045) | (0x43c00044) |
|   | bench:   |           | (0x43c0004b) |   | (0x43c0004a) | (0x43c00049) | (0x43c00048) |
|   | PL2PS:   |           | (0x43c0004f) |   | (0x43c0004e) | (0x43c0004d) | (0x43c0004c) |
|   | CHANGE : |           | (0x43c00053) |   | (0x43c00052) | (0x43c00051) | (0x43c00050) |
|   |          |           |              |   |              |              |              |

#### devmem @x43c00000 b 1

Correctness and module debug

To verify the correct computation of the accelerator:

a tool to monitor the AXI memory

write directly to AXI memory mapped input addresses (through devmem)

check the AXI memory mapped output addresses

| i, | _          | <br>x43c00000 -n  |              |               |               | 1 |
|----|------------|-------------------|--------------|---------------|---------------|---|
|    | 10:        | (0x43c00000 - N ) | (0           | 10-12-000013  | (0            |   |
|    |            |                   |              |               |               |   |
|    |            | (0x43c00007)      |              |               |               |   |
|    |            | (0x43c0000b)      |              |               |               |   |
|    |            | (0x43c0000f)      | (0x43c0000e) | (0x43c0000d)  | (0x43c0000c)  |   |
|    |            | (0x43c00013)      | (0x43c00012) | (0x43c00011)  | (0x43c00010)  |   |
|    |            | (0x43c00017)      | (0x43c00016) | (0x43c00015)  | (0x43c66614)  |   |
|    |            | (0x43c0001b)      | (0x43c0001a) | (0x43c00019)  | (0x43c00018)  |   |
|    |            | (0x43c0001f)      | (0x43c0001e) | (0x43c0001d)  | (0x43c0001c)  |   |
|    | PS2PL:     | (0x43c00023)      | (0x43c00022) | (0x43c00021)  | (0x43c00020)  |   |
|    | STATES:    | (0x43c00027)      | (0x43c00026) | (0x43c00025)  | (0x43c00024)  |   |
|    |            | (0x43c0002b)      | (0x43c0002a) | (0x43c00029)  | (0x43c00028)  |   |
|    |            | (0x43c0002f)      | (0x43c0002e) | (0x43c0002d)  | (0x43c0002c)  |   |
|    |            | (0x43c00033)      | (0x43c00032) | (0x43c00031)  | (0x43c00030)  |   |
|    |            | (0x43c00037)      | (0x43c00036) | (0x43c00035)  | (0x43c00034)  |   |
|    |            | (@x43c0003b)      | (0x43c0003a) | (0x43c00039)  | (0x43c00038)  |   |
|    |            | (0x43c0003f)      | (0x43c0003e) | (0x43c0003d)  | (0x43c0003c)  |   |
|    | 06:        | (0x43c00043)      | (0x43c00042) | (0x43c000041) | (8x43c88848)  |   |
|    |            | (0x43c00047)      |              |               |               |   |
|    | bench:     | (0x43c0004b)      |              |               |               |   |
|    | PL2PS:     | (0x43c0004f)      |              |               |               |   |
|    | CHANGE :   | (0x43c00053)      |              |               |               |   |
|    | CHIMINGE : | (0,45000055)      | (0043690935) | (00051)       | (0)/426999393 |   |

#### devmem 0x43c00000 b 1

# An example of error

| # ./mor  | nitor -g @ | 9x43c00000 -n | 13       |              |                            |              |
|----------|------------|---------------|----------|--------------|----------------------------|--------------|
|          |            | (0x43c00003)  |          | (0x43c00002) | (0x43c00001)               | (0x43c00000) |
|          |            | (0x43c00007)  |          | (0x43c00006) | (0x43c00005)               | (0x43c00004) |
|          |            | (0x43c0000b)  |          | (0x43c0000a) | (0x43c00009)               | (0x43c00008) |
|          |            | (0x43c0000f)  |          | (0x43c0000e) | (0x43c0000d)               | (0x43c0000c) |
|          |            | (0x43c00013)  |          | (0x43c00012) | (0x43c00011)               | (0x43c00010) |
|          |            | (0x43c00017)  |          | (0x43c00016) | (0x43c00015)               | (0x43c00014) |
|          |            | (0x43c0001b)  |          | (0x43c0001a) | (0x43c00019)               | (0x43c00018) |
|          |            | (0x43c0001f)  |          |              | (0x43c0001d)               | (0x43c0001c) |
|          |            | (0x43c00023)  | 00000000 | (0x43c00022) | (0x43c00021)               | (0x43c00020) |
|          |            | (0x43c00027)  |          |              |                            |              |
| i10:     |            | (0x43c0002b)  |          |              | (0x43c00029)               | (0x43c00028) |
|          |            | (0x43c0002f)  |          |              | (0x43c0002d)               | (0x43c0002c) |
|          |            | (0x43c00033)  |          |              |                            | (0x43c00030) |
| PS2PL:   |            | (0x43c00037)  |          |              |                            | (0x43c00034) |
| STATES:  |            | (0x43c0003b)  |          |              | (0x43c00039)               | (0x43c00038) |
| 00:      |            | (0x43c0003f)  |          | (0x43c0003e) | (0x43c0003d)               | (0x43c0003c) |
| 01:      |            | (0x43c00043)  |          |              | (0x43c00041)               |              |
| o2:      |            | (0x43c00047)  |          |              |                            |              |
| 03:      |            | (0x43c0004b)  |          |              |                            |              |
| o4:      |            | (0x43c0004f)  |          |              |                            |              |
| o5:      |            | (0x43c00053)  |          | (0x43c00052) | (0x43c00051)               |              |
| 06:      |            | (0x43c00057)  |          | (0x43c00056) | (0x43c000 <mark>65)</mark> |              |
| o7:      |            | (0x43c0005b)  |          | (0x43c0005a) | (0x43c00059)               |              |
| 08:      |            | (0x43c0005f)  |          |              | (0x43c0005d)               |              |
| 09:      |            | (0x43c00063)  |          | (0x43c00062) | (0x43c00061)               |              |
| o10:     |            | (0x43c00067)  |          |              | (0x43c00065)               |              |
| 011:     |            | (0x43c0006b)  |          |              |                            |              |
| 012:     |            | (0x43c0006f)  |          |              |                            |              |
|          |            | (0x43c00073)  |          |              |                            |              |
| PL2PS:   |            | (0x43c00077)  |          |              |                            |              |
| CHANGE : |            | (0x43c0007b)  |          | (0x43c0007a) | (0x43c00079)               | (0x43c00078) |
|          |            |               |          |              |                            |              |

ICSC FPGA Course, 27/06/2025

### An example of error

| i2:<br>i3:<br>i4: | 00000000<br>00000000 | (0x43c00007) | 00000000  |              |          |                             |           |              |
|-------------------|----------------------|--------------|-----------|--------------|----------|-----------------------------|-----------|--------------|
| i3:<br>i4:        |                      |              | 000000000 | (0x43c00006) |          | (0x43c00005)                |           | (0x43c00004) |
| i4:               |                      | (0x43c0000b) |           | (0x43c0000a) |          | (0x43c00009)                |           | (0x43c00008) |
|                   | 000000000            | (0x43c0000f) |           | (0x43c0000e) |          | (0x43c0000d)                |           | (0x43c0000c) |
|                   |                      | (0x43c00013) |           | (0x43c00012) |          | (0x43c00011)                |           | (0x43c00010) |
|                   | 000000000            | (0x43c00017) |           | (0x43c00016) |          | (0x43c00015)                |           | (0x43c00014) |
| i6: I             | 000000000            | (0x43c0001b) |           | (0x43c0001a) |          | (0x43c00019)                |           | (0x43c00018) |
|                   |                      | (0x43c0001f) |           | (0x43c0001e) |          | (0x43c0001d)                |           | (0x43c0001c) |
| i8: I             |                      | (0x43c00023) |           | (0x43c00022) |          | (0x43c00021)                |           | (0x43c00020) |
| i9: I             | 000008800            | (0x43c00027) |           | (0x43c00026) |          | (0x43c00025)                |           | (0x43c00024) |
| i10: 👘            |                      | (0x43c0002b) |           | (0x43c0002a) |          | (0x43c00029)                |           | (0x43c00028) |
| i11: )            |                      | (0x43c0002f) |           | (0x43c0002e) |          | (0x43c0002d)                |           | (0x43c0002c) |
| i12: I            |                      | (0x43c00033) |           | (0x43c00032) |          | (0x43c00031)                |           | (0x43c00030) |
| PS2PL:            |                      | (0x43c00037) |           | (0x43c00036) |          | (0x43c00035)                |           | (0x43c00034) |
| STATES:           |                      | (0x43c0003b) |           | (0x43c0003a) |          | (0x43c00039)                |           | (0x43c00038) |
| :00               | 0000000000           | (0x43c0003f) |           | (0x43c0003e) |          | (0x43c0003d)                |           | (0x43c0003c) |
| o1: I             | 000008800            | (0x43c00043) |           | (0x43c00042) |          | (0x43c00041)                |           | (0x43c00040) |
| 52:               |                      | (0x43c00047) |           | (0x43c00046) |          | (0x43c00045)                |           | (0x43c00044) |
| o3: I             | 0000000000           | (0x43c0004b) |           | (0x43c0004a) |          | (0x43c000 <u>49)</u>        | 00000100  | (0x43c00048) |
| o4: I             | 000008800            | (0x43c0004f) |           | (0x43c0004e) |          | (0x43c000]d)                | 000000001 | (0:43c0004c) |
| 5:                |                      | (0x43c00053) |           | (0x43c00052) |          | (0x43c00051)                |           | (0:43c00050) |
| D6: I             |                      | (0x43c00057) |           | (0x43c00056) |          | (0x43c000 <mark>65</mark> ) | 00000010  | (0:43c00054) |
| o7: I             | 000000000            | (0x43c0005b) |           | (0x43c0005a) |          | (0x43c00059)                | 00000010  | (0x43c00058) |
|                   | 000000000            | (0x43c0005f) |           | (0x43c0005e) | 86600098 | (0x43c0005d)                | 00000100  | (0x43c0005c) |
| 9: 0              | 000000000            | (0x43c00063) |           | (0x43c00062) |          | (0x43c00061)                |           | (0x43c00060) |
| 510:              | 000000000            | (0x43c00067) |           | (0x43c00066) |          | (0x43c00065)                |           | (0x43c00064) |
| o11: 0            | 000000000            | (0x43c0006b) |           | (0x43c0006a) |          | (0x43c00069)                |           | (0x43c00068) |
| 512:              | 000000000            | (0x43c0006f) |           | (0x43c0006e) |          | (0x43c0006d)                |           | (0x43c0006c) |
| 513 bcm           | 000000000            | (0x43c00073) |           | (0x43c00072) |          | (0x43c00071)                |           | (0x43c00070) |
| PL2PS:            |                      | (0x43c00077) |           | (0x43c00076) |          | (0x43c00075)                |           | (0x43c00074) |
| CHANGE:           |                      | (0x43c0007b) |           | (0x43c0007a) |          | (0x43c00079)                |           | (0x43c00078) |

•••

ICSC FPGA Course, 27/06/2025



The FPGA benchmarks do not include the PS part overhead (the comparisons are not really fair)

# Benchmark: the CPU (Golang)

Time measures: built-in golang facilities

- Energy measures: perf
- Intel(R) Xeon(R) CPU E3-1270 v5 @ 3.60GHz

Go 1.18.2

| 2   | 0.00543209 | 259280   | 3.858015-00 |
|-----|------------|----------|-------------|
| 8   | 0.01831868 | 484280   | 2.305478-06 |
| 4   | 0.02399964 | 722280   | L304002-06  |
| 5   | 0.00632906 | 1870400  | 9.34235-07  |
|     | 0.00570083 | 1471400  | 0.736234-07 |
| 7   | 0.07363833 | 1835800  | 5.355822-47 |
|     | 0.09997730 | 2737800  | 0.05364E-87 |
| 1.1 | 0.12227912 | 3429000  | 2.515136-07 |
| 30  | 0.36490378 | 4465800  | 2.338396-47 |
| 11  | 0.00173032 | 5530300  | L80822E-87  |
| 32  | 0.34205632 | 6643300  | L505216-87  |
| 11  | 0.38554472 | 1752800  | 1.208238-07 |
| 34  | 0.35400825 | 8954800  | L.13582E-07 |
| 15  | 0.3061176  | 18630508 | 9.40434E-00 |
| 33  | 0.41550508 | 11832200 | 8.416518-00 |
| 37  | 0.5084054  | 13004308 | 7.35042-08  |
| 35  | 0.5063083  | 15124508 | 6.52550-08  |
| 11  | 0.63335665 | 17024400 | 5.306336-00 |
| 20  | 0.708354   | 18718300 | 5.0728-08   |
| 21  | 0.3553206  | 22133800 | 4.517908-00 |
| 11  | 0.0030085  | 22525300 | 4.250706-00 |
| 23  | 0.07467220 | 27348930 | 3.454754-01 |
| 24  | 1.3031791  | 28358308 | 3.429958-05 |



ICSC FPGA Course, 27/06/2025







Benchmark an IP is not an easy task.

Fortunately we have a custom design and an FPGA.

We can put the benchmarks tool inside the accelerator.



Benchmark an IP is not an easy task.

Fortunately we have a custom design and an FPGA.

We can put the benchmarks tool inside the accelerator.



Benchmark an IP is not an easy task.

Fortunately we have a custom design and an FPGA.

We can put the benchmarks tool inside the accelerator.



Benchmark an IP is not an easy task.

Fortunately we have a custom design and an FPGA.

We can put the benchmarks tool inside the accelerator.



Benchmark an IP is not an easy task.

Fortunately we have a custom design and an FPGA.

We can put the benchmarks tool inside the accelerator.



# Benchmark core clock cycles distributions



#### FPGA benchmark summary

| 1 |    |                     |               |            |       |                       |     |
|---|----|---------------------|---------------|------------|-------|-----------------------|-----|
|   | N  | single op time (us) | Register LUTs | Slice LUTs | Power | single op energy (pJ) | CPs |
| 1 | 2  | 0.1044              | 947           | 875        | 0.005 | 522                   | 6   |
| 2 | 4  | 0.1587              | 1457          | 1813       | 0.015 | 2380.5                | 20  |
| 3 | 8  | 0.2819              | 3131          | 4897       | 0.049 | 13813.1               | 72  |
| 4 | 13 | 0.4456              | 6422          | 12819      | 0.138 | 61492.8               | 182 |
| 5 | 16 | 0.5234              | 7950          | 15979      | 0.160 | 83744                 | 272 |
| 6 | 24 | 0.7432              | 10974         | 22669      | 0.199 | 147896.8              | 600 |

#### Benchmark core



ICSC FPGA Course, 27/06/2025
## Comparisons: Performace



## Comparisons: Energy



ICSC FPGA Course, 27/06/2025





## FPGA

- Digilent Zedboard
- Soc: Zynq XC7Z020-CLG484-1
- 512 MB DDR3
- Vivado 2020.2
- 100MHz
- PYNQ 2.6 (custom build)



# Different boards

All tests were done using the **Zedboard** device, but BondMachine supports different boards also from different vendors (Intel lattice).



Xilinx Zynq-7000 SoC 85000 logic cells 53200 look-up tables (LUTs) PCIe card 2800000 logic cells 1732000 Look-Up Tables (LUTs) FPGA cluster ICSC Xilinx and Intel FPGAs

National supercomputing center (ICSC)

and often a bottleneck ...



# Machine Learning with BondMachine

Architectures with multiple interconnected processors like the ones produced by the BondMachine Toolkit are a perfect fit for Neural Networks and Computational Graphs.

Several ways to map this structures to BondMachine has been developed:

- A native Neural Network library
- A Tensorflow to BondMachine translator
- An NNEF based BondMachine composer

# Machine Learning with BondMachine

Architectures with multiple interconnected processors like the ones produced by the BondMachine Toolkit are a perfect fit for Neural Networks and Computational Graphs.

Several ways to map this structures to BondMachine has been developed:

- A native Neural Network library
- A Tensorflow to BondMachine translator
- An NNEF based BondMachine composer

### Machine Learning with BondMachine Native Neural Network library

The tool *neuralbond* allow the creation of BM-based neural chips from an API go interface.

Neurons are converted to BondMachine connecting processors.

Tensors are mapped to CP connections.



ICSC FPGA Course, 27/06/2025





ICSC FPGA Course, 27/06/2025

Machine Learning with BondMachine NNEF Composer

Neural Network Exchange Format (NNEF) is a standard from Khronos Group to enable the easy transfer of trained networks among frameworks, inference engines and devices

The NNEF BM tool approach is to descent NNEF models and build BondMachine multi-core accordingly

This approch has several advandages over the previous:

- It is not limited to a single framework
- NNEF is a textual file, so no complex operations are needed to read models

# BM inference: A first tentative idea

A neuron of a neural network can be seen as Connecting Processor of BM

H1



#### 

 $e^{z_j}$ 

| %section softmax .romtext iomode:sync                   |  |
|---------------------------------------------------------|--|
| entry _start ; Entry point                              |  |
| _start:                                                 |  |
| mov r8, 0f0.0                                           |  |
| <pre>{{range \$y := intRange "0" .Params.inputs}}</pre> |  |
| {{printf "i2r r1,i%d\n" \$y}}                           |  |
| mov r0, 0f1.0                                           |  |
| mov r2, 0f1.0                                           |  |
| mov r3, 0f1.0                                           |  |
| mov r4, 0f1.0                                           |  |
| mov r5, 0f1.0                                           |  |
| <pre>mov r7, {{\$.Params.expprec}}</pre>                |  |
| <pre>loop{{printf "%d" \$y}}:</pre>                     |  |
| multf r2, r1                                            |  |
| multf r3, r4                                            |  |
| addf r4, r5                                             |  |
| mov r6, r2                                              |  |
| mov r6,r2<br>divf r6,r3                                 |  |
| · ·                                                     |  |
| addf r0, r6                                             |  |
| · ·                                                     |  |
| dec r7                                                  |  |
| jz r7,exit{{printf "%d" \$y}}                           |  |
| <pre>j loop{{printf "%d" \$y}}</pre>                    |  |
| <pre>exit{{printf "%d" \$y}}:</pre>                     |  |
| {{\$z := atoi \$.Params.pos}}                           |  |
| {{if eq \$y \$z}}                                       |  |
| mov r9, r0                                              |  |
| %endsection                                             |  |
|                                                         |  |

inputs hidden layer output layer outputs

S1

S2

Y1

Y2

ICSC FPGA Course, 27/06/2025

X1

X2

Х3

Χ4

The BondMachine Project

-6 -4 -2 0 2 4

 $\sigma(\vec{z})_i$ 

### From idea to implementation

Starting from High Level Code, a NN model trained with **TensorFlow** and exported in a standard interpreted by **neuralbond** that converts nodes and weights of the network into a set of heterogeneous processors.



ICSC FPGA Course, 27/06/2025



A first test Dataset info:

- **Dataset name**: Banknote Authentication
- **Description**: Dataset on the distinction between genuine and counterfeit banknotes. The data was extracted from images taken from genuine and fake banknote-like samples.
- N. features: 4
- Classification: binary
- Samples: 1097

Neural network info: Class: Multilayer perceptron fully connected

Layers:

 An hidden layer with 1 linear neuron
One output layer with 2 softmax neurons

Graphic representation:





To build a BondMachine with a trained Neural Network

Interact with the BondMachine via Jupyter

Benchcore

Fortunately we have a custom design and an FPGA.

We can put the benchmarks tool inside the accelerator.



Benchcore

Fortunately we have a custom design and an FPGA.

We can put the benchmarks tool inside the accelerator.



Benchcore

Fortunately we have a custom design and an FPGA.

We can put the benchmarks tool inside the accelerator.



Benchcore

Fortunately we have a custom design and an FPGA.

We can put the benchmarks tool inside the accelerator.



Benchcore

Fortunately we have a custom design and an FPGA.

We can put the benchmarks tool inside the accelerator.



## Inference evaluation

Evaluation metrics used:

**Inference speed**: time taken to predict a sample i.e. time between the arrival of the input and the change of the output measured with the **benchcore**; **Resource usage**: luts and registers in use;

Accuracy: as the average percentage of error on probabilities.



 $\sigma$ : 2875.94

Mean: 10268.45

Latency: 102.68 µs

| resource | value | occupancy |
|----------|-------|-----------|
| regs     | 15122 | 28.42%    |
| luts     | 11192 | 10.51%    |

#### Resource usage

ICSC FPGA Course, 27/06/2025

#### The BondMachine Project

60





# A first example of optimization

### Remember the softmax function?

$$\underbrace{\sigma(z_i) = \frac{e^{z_i}}{\sum_{j=1}^N e^{z_j}}}_{$$

$$e^{x} = \sum_{l=0}^{K} \frac{x^{l}}{l!}$$

••• %section softmax .romtext iomode:sync entrv start : Entry point start: mov r8. 0f0.0 {{range \$y := intRange "0" .Params.inputs}} {{printf "i2r r1,i%d\n" \$v}} r0, 0f1.0 mov mov r2, 0f1.0 r3. 0f1.0 mov r4, 0f1.0 mov mov r5, 0f1.0 r7, {{\$.Params.expprec}} mov loop{{printf "%d" \$y}} multf r2, r1 multf r3, r4 addf r4, r5 mov r6. r2 divf r6. r3 addf r0, r6 dec r7.exit{{printf "%d" \$v}} loop{{printf "%d" \$v}} exit{{printf "%d" \$y}}: {{\$z := atoi \$.Params.pos}} {{if eq \$v \$z}} mov r9, r0

%endsection





















| К  | Latency   | Err prob0  | Err prob1  | pred |
|----|-----------|------------|------------|------|
| 1  | 9.23 µs   | 0.0990     | 0.0990     | 100% |
| 2  | 13.11 µs  | 0.0193     | 0.0193     | 100% |
| 3  | 17.50 µs  | 0.0053     | 0.0053     | 100% |
| 5  | 29.11 µs  | 3.1070E-05 | 3.1071E-05 | 100% |
| 8  | 39.13 µs  | 6.5562E-07 | 6.6098E-07 | 100% |
|    | 47.66 µs  | 1.6162E-07 | 1.6525E-07 | 100% |
|    | 63.12 µs  | 1.6470E-07 | 1.6652E-07 | 100% |
|    | 79.46 µs  | 1.6470E-07 | 1.6652E-07 | 100% |
| 20 | 102.68 µs | 1.6470E-07 | 1.6652E-07 | 100% |

Reduced inference times by a factor of 10 only by decreasing the number of iterations.

|     | book      |            |       |   |          |              |            |    |
|-----|-----------|------------|-------|---|----------|--------------|------------|----|
|     |           | 277        |       |   |          |              |            |    |
| And | other not | ebook is i |       |   | uns from | different ac | celerators | 5. |
|     |           |            |       |   |          |              |            |    |
|     |           | Software   |       |   |          | BondMachine  | 9          |    |
|     | prob0     | prob1      | class | ] | prob0    | prob1        | class      |    |
|     | 0.6895    | 0.3104     | 0     | ] | 0.6895   | 0.3104       | 0          |    |
|     | 0.5748    | 0.4251     | 0     | 1 | 0.5748   | 0.4251       | 0          |    |

1

The output of the bm corresponds to the software output

0.4009

0.5990

1

### Open the notebook

0.4009

0.5990




# Floating point FloPoCo

**FloPoCo** is an open source software project that provides a toolchain for automatically generating floating-point arithmetic operators implemented in hardware.

Features:

./flopoco pipeline=yes frequency=300 FPAdd wE=8 wF=23
Final report:
|---Entity RightShifter\_24\_by\_max\_26\_F300\_uid4
| Pipeline depth = 1
|---Entity IntAdder\_27\_f300\_uid8
| Not pipelined
| ---Entity LZCShifter\_28\_to\_28\_counting\_32\_F300\_uid16
| Pipeline depth = 2
|---Entity IntAdder\_34\_f300\_uid20
| Not pipelined
Entity FPAdd\_8\_23\_F300\_uid2
Pipeline depth = 6
Output file: flopoco.vhdl



ICSC FPGA Course, 27/06/2025

The BondMachine Project

exponent size and mantissa size can take arbitrary values

 $\blacksquare$  0,  $\infty$  and NaN in explicit exception bits

- not as special exponent values
- two more exponent values available in FloPoCo
- hardware efficient

| <b>2</b> ; |   | <b>₩</b> E | <b>₩</b> F → |
|------------|---|------------|--------------|
|            | S |            | F            |

Tests FloPoCo implementation

We've already seen the pros and cons of changing the numerical precision

ro Cons Reduced memory usage **Reduced accuracy** Increased computational speed Increased rounding errors Lower power consumption Limited range

How much computationally faster are the arithmetic operations implemented by FloPoCo?

How do latency, accuracy, occupancy and power consumption vary by changing the numerical precision and the exponent of the exponential?

# Tests and results with FloPoCo



ICSC FPGA Course, 27/06/2025

# Tests and results with FloPoCo





ICSC FPGA Course, 27/06/2025

# Tests and results with FloPoCo





ICSC FPGA Course, 27/06/2025

# Tests and results with FloPoCo



ICSC FPGA Course, 27/06/2025

Tests and results with FloPoCo

|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       | 19bit FloPoCo |           |            |     |                                                                                                                           |          | 16bit Flo | ΡοϹο      |        |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|-----------|------------|-----|---------------------------------------------------------------------------------------------------------------------------|----------|-----------|-----------|--------|
| 25 -<br>57 -<br>97 - 20 -<br>90 - 2 | 220           |           |            |     | 25<br>120<br>130<br>5<br>1<br>2<br>3<br>5<br>1<br>2<br>3<br>5<br>10<br>10<br>10<br>10<br>10<br>10<br>10<br>10<br>10<br>10 |          |           |           |        |
| К                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | Latency       | Err prob0 | Err prob1  | ] [ | κ                                                                                                                         | Latency  | Err prob0 | Err prob1 | Pred   |
| 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | 3.80 µs       | 0.1229    | 0.009      |     | 1                                                                                                                         | 3.59 µs  | 1.3935    | 0.099     | 99.27% |
| 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | 5.04 µs       | 0.0193    | 0.0193     |     | 2                                                                                                                         | 5.93 µs  | 0.0192    | 0.0191    | 100%   |
| 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | 6.44 µs       | 0.0054    | 0.0054     | 1   | 3                                                                                                                         | 6.21 µs  | 0.0057    | 0.0057    | 100%   |
| 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | 9.21 µs       | 0.00024   | 0.00025    |     | 5                                                                                                                         | 8.74 µs  | 0.00125   | 0.0019    | 100%   |
| 8                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | 13.33 µs      | 0.00010   | 9.9151E-05 |     | 8                                                                                                                         | 12.54 µs | 0.00125   | 0.0019    | 100%   |
| 10                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | 15.95 µs      | 0.00010   | 9.9151E-05 |     | 10                                                                                                                        | 15.04 µs | 0.0012    | 0.0019    | 100%   |
| 13                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | 20.17 µs      | 0.00010   | 9.9151E-05 |     | 13                                                                                                                        | 19.32 µs | 0.0026    | 0.0025    | 99.63% |
| 16                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | 23.70 µs      | 0.00010   | 9.9151E-05 |     | 16                                                                                                                        | 22.87 µs | 0.0037    | 1.8113    | 99.63% |
| 20                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | 29.67 µs      | 0.00010   | 9.9151E-05 |     | 20                                                                                                                        | 27.91 µs | 0.0060    | 4.1385    | 98.54% |

ICSC FPGA Course, 27/06/2025

# Tests and results with FloPoCo



ICSC FPGA Course, 27/06/2025

# Tests and results with FloPoCo

| ĺ                                                                                 | 16bit FloPoCo |              |           |        |  |  |  |  |  |
|-----------------------------------------------------------------------------------|---------------|--------------|-----------|--------|--|--|--|--|--|
| $\left(\begin{array}{c} \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\$ |               |              |           |        |  |  |  |  |  |
| к                                                                                 | Latency       | Err<br>prob0 | Err prob1 | Pred   |  |  |  |  |  |
| 1                                                                                 | 3.59 µs       | 1.3935       | 0.099     | 99.27% |  |  |  |  |  |
| 2                                                                                 | 5.93 µs       | 0.0192       | 0.0191    | 100%   |  |  |  |  |  |
| 3                                                                                 | 6.21 µs       | 0.0057       | 0.0057    | 100%   |  |  |  |  |  |
| 5                                                                                 | 8.74 µs       | 0.00125      | 0.0019    | 100%   |  |  |  |  |  |
| 8                                                                                 | 12.54 µs      | 0.00125      | 0.0019    | 100%   |  |  |  |  |  |
| 10                                                                                | 15.04 µs      | 0.0012       | 0.0019    | 100%   |  |  |  |  |  |
| 13                                                                                | 19.32 µs      | 0.0026       | 0.0025    | 99.63% |  |  |  |  |  |
| 16                                                                                | 22.87 µs      | 0.0037       | 1.8113    | 99.63% |  |  |  |  |  |
| 20                                                                                | 27.91 µs      | 0.0060       | 4.1385    | 98.54% |  |  |  |  |  |



ICSC FPGA Course, 27/06/2025

Results with FloPoCo

How do latency, accuracy, occupancy and power consumption vary by changing the

numerical precision ? 14000 12000 10000 8000 6000 bits Bits Usage Luts 8.84% 114704 7738 14.54% 16 13.54% 19 7202 32 14306 26.89%











| Bits | Power   |
|------|---------|
| 11   | 0.096 W |
| 16   | 0.163 W |
| 19   | 0.198 W |
| 32   | 0.487 W |



### Linear quantization

Linear quantization is a widely used technique in signal processing, in particular in neural networks **reduces memory usage and computational complexity** by representing values with fewer bits, enabling **efficient deployment on resource-constrained devices** (but it may introduce some loss of accuracy).



**BMnumbers** translates the floating point number into the quantized equivalent using the data type lqs[s]t[t]

bmnumbers --show native -cast lqs16t1 -linear-data-range 1,ranges.txt "05<16>010010110" 0lq<16.1>13.73291015625

 $\ensuremath{\textbf{Corrected}}$  signed integer instructions are used in hardware

Quantized networks can be **simulated** to check if the precision is acceptable.

### Quantization: tests, results and analysis

Linear quantization reduces memory usage and computational complexity by representing values with fewer bits, enabling efficient deployment on resource constrained devices (but it may introduce some loss of accuracy) FloPoCo Quantization





|      |             | FloPoCo   |        |         |      |
|------|-------------|-----------|--------|---------|------|
| Bits | Luts        | Regs      | Power  | Latency | Pred |
| 16   | 7738 (14%)  | 5487 (5%) | 0.163W | 6.21 µs | 100% |
| 32   | 14306 (26%) | 9264 (8%) | 0.487W | 6.84 µs | 100% |

| Bits | Luts        | Regs      | Power  | Latency | Pred |
|------|-------------|-----------|--------|---------|------|
| 8    | 2013 (3%)   | 2054 (2%) | 0.024W | 1.60 µs | 91%  |
| 16   | 5259 (9%)   | 2774 (3%) | 0.087W | 1.60 µs | 99%  |
| 32   | 11823 (22%) | 4718 (5%) | 0.203W | 1.61 µs | 99%  |

ICSC FPGA Course, 27/06/2025



To build a BondMachine with a trained Neural Network ...

… with floating point 16bit precision

Interact with the BondMachine via Jupyter



To build a BondMachine with a trained Neural Network ...

… with fixed point 16bit

Interact with the BondMachine via Jupyter

# The tools (neuralbond+basm) create a graph of relations among fragments of assembly

Not necessarily a fragment has to be mapped to a single CP

- They can arbitrarily be rearranged into CPs
- The resulting firmwares are identical in term of the computing outcome, but differs in occupancy and latency.



The tools (neuralbond+basm) create a graph of relations among fragments of assembly Not necessarily a fragment has to be mapped to a single CP

They can arbitrarily be rearranged into CPs The resulting firmwares are identical in term of the computing outcome, but differs in occupancy and latency.



The tools (neuralbond+basm) create a graph of relations among fragments of assembly Not necessarily a fragment has to be mapped to a single CP

They can arbitrarily be rearranged into CPs

The resulting firmwares are identical in term of the computing outcome, but differs in occupancy and latency.



The tools (neuralbond+basm) create a graph of relations among fragments of assembly Not necessarily a fragment has to be mapped to a single CP

They can arbitrarily be rearranged into CPs

The resulting firmwares are identical in term of the computing outcome, but differs in occupancy and latency.





Prune a processor and find out the outcomes



Collapse processors and find out the outcomes



Copy a project directory and try pruning, collapsing, simulating and the assembly of the neurons





# Quantum Computing

With all the capabilities of the BondMachine in terms of parallelism and speed, of customizability of the instruction set and the numerical precision, it is a natural question to ask whether the BondMachine could be used to simulate quantum computers.



A quantum computer simulator called bmqsim has been developed and is available within the BondMachine project.

ICSC FPGA Course, 27/06/2025

# Quantum Circuit

The first ingredient for bmqsim is a quantum circuit. The quantum circuit is a sequence of quantum gates represented by a sequence of matrices. the "program" is a .bmq file that contains code similar to the Qasm code.



Independently of the backend, bmqsim translates the .bmq file into N matrices.

ICSC FPGA Course, 27/06/2025

Backend: Hardcoded matrices sequence

This backend creates a hardware that for each state of the quantum register, it applies the sequence of matrices.

For each matrix operation a dedicated processor is used. Within the processor, the matrix elements of all the gates are hardcoded.







ICSC FPGA Course, 27/06/2025



ICSC FPGA Course, 27/06/2025

Command > bmhelper create --project\_name Example Command > cd Example

ICSC FPGA Course, 27/06/2025

[ Command > bmhelper create --project\_name Example
[ Command > cd Example
[ Command >
source /tools/Xilinx/Vitis/2023.2/settings64.sh
[ Output >

ICSC FPGA Course, 27/06/2025 The B

```
Command > bmhelper create --project_name Example.
 Command > cd Example
 Command >
source /tools/Xilinx/Vitis/2023.2/settings64.sh
 Command > cat <<EOF > program.bmg
%block codel .sequential
       gbits s,a,b
                s.a.b
                a.b
                s,a
                a,b
                s,b
%endblock
%meta bmdef global main:codel
FOF
```

ICSC FPGA Course, 27/06/2025

%meta bmdef global main:codel FOF

#### Command >

 $cat \ll FOF > local mk$ WORKING DIR=working dir CURRENT DIR=\$(shell pwd) SOURCE QUANTUM=program.bmg OUANTUM APP=working dir/circuit.c QUANTUM ARGS=-build-matrix-seg-hardcoded -hw-flavor seg hardcoded complex -app-flavor cpp opencl com plex -build-app -app-file \$(OUANTUM APP) -emit-bmapi-maps -bmapi-maps-file bmapi.json BOARD=alveou55c SHOWARGS=-dot-detail 5 SHOWRENDERER=dot -Txlib VERILOG OPTIONS=-comment-verilog -bcof-file \$(WORKING DIR)/bondmachine.bcof BMREQS=\$(WORKING DIR)/requirements.ison HWOPTIMIZATIONS=onlydestregs,onlysrcregs BASM ARGS=-disable-dynamical-matching -bo \$(WORKING DIR)/bondmachine.bcof -chooser-min-word-size -ch ooser-force-same-name -dump-requirements \$(WORKING DIR)/requirements.ison HDL REGRESSION=bondmachine.sv BM REGRESSION=bondmachine.ison PLATFORM=xilinx u55c gen3x16 xdma 3 202210 1 MAPFILE=alveou55c maps.ison include bmapi.mk include deplov.mk F0F
SHOWARGS=-dot-detail 5 SHOWRENDERER=dot -Txlib VERILOG OPTIONS=-comment-verilog -bcof-file \$(WORKING DIR)/bondmachine.bcof BMREQS=\$(WORKING DIR)/requirements.json HWOPTIMIZATIONS=onlvdestreas.onlvsrcreas BASM ARGS=-disable-dynamical-matching -bo \$(WORKING DIR)/bondmachine.bcof -chooser-min-word-size -ch ooser-force-same-name -dump-requirements \$(WORKING DIR)/requirements.json HDL REGRESSION=bondmachine.sv BM REGRESSION=bondmachine.ison PLATFORM=xilinx\_u55c\_gen3x16\_xdma\_3\_202210\_1 MAPFILE=alveou55c maps.ison include bmapi.mk include deplov.mk FOF

#### Command >

cat <<EOF > bmapi.mk USE BMAPI=yes BMAPI LANGUAGE=pvthon BMAPI FLAVOR=axist BMAPI FLAVOR VERSION=basic BMAPI MAPFILE=bmapi.ison BMAPI LIBOUTDIR=working dir/bmapi BMAPI MODOUTDIR=working\_dir/rtl\_bondmachine BMAPI<sup>\_</sup>FRAMEWORK=pvng BMAPI GENERATE EXAMPLE=notebook.ipvnb F0F

#### PLATFORM=xilinx\_u55c\_gen3x16\_xdma\_3\_202210\_1 MAPFILE=alveou55c\_maps.json include bmapi.mk

include deploy.mk FOF

#### [ Command >

cat <<EOF > bmapi.mk USE\_BMAPI\_yess BMAPI\_LANGUAGE=python BMAPI\_FLAVOR=axist BMAPI\_FLAVOR\_VERSION=basic BMAPI\_MAPFILE=bmapi.json BMAPI\_LIBOUTDIR=working\_dir/bmapi BMAPI\_MODOUTDIR=working\_dir/rtl\_bondmachine BMAPI\_GENERATE\_EXAMPLE=notebook.ipynb EOF

#### Command >

cat <<EOF > deploy.mk
DEPLOY\_TYPE=local
DEPLOY\_PATH=/home/mirko/alveoruns/\$(PROJECT\_NAME)
DEPLOY\_CLONE=/home/mirko/alveoruns/template
DEPLOY\_CLONE=/home/mirko/alveoruns/template
DEPLOY\_OVERRIDE=true
DEPLOY\_BITTYPE=xclbin
FoF

ICSC FPGA Course, 27/06/2025

BMAPI\_MAPFILE=bmapi.json BMAPI\_LIBOUTDIR=working\_dir/bmapi BMAPI\_MODOUTDIR=working\_dir/rtl\_bondmachine BMAPI\_FRAMEWORK=pynq BMAPI\_GENERATE\_EXAMPLE=notebook.ipynb EOF

#### [ Command >

cat <<EOF > deploy.mk DEPLOY\_TYPE=local DEPLOY\_PATH=/home/mirko/alveoruns/\$(PROJECT\_NAME) DEPLOY\_CLONE=/home/mirko/alveoruns/template DEPLOY\_APP=working\_dir/circuit.c DEPLOY\_OVERRIDE=true DEPLOY\_BITTYPE=xclbin

#### EOF

Command >

bmhelper vali<u>date</u>

#### Output >

| URCE_QUANTUM |
|--------------|
| RKING_DIR    |
| PFILE        |
| IOWARGS      |
| ind          |
| 5c           |
| ly validate. |
|              |

|                  | ng_dir/circuit.c                           |
|------------------|--------------------------------------------|
| DEPLOY_OVERRIDE= | true                                       |
| DEPLOY_BITTYPE=> | clbin                                      |
| EOF              |                                            |
| [ Command >      |                                            |
| bmhelper validat | e                                          |
| [ Output >       |                                            |
| [ OK ]           | Workflow detected: quantum.                |
| [ OK ]           | Mandatory variable found SOURCE QUANTUM    |
| [ OK ]           | Mandatory variable found WORKING DIR       |
| [ OK ]           | Mandatory variable found MAPFILE           |
| [ OK ]           | Optional variable found: SHOWARGS          |
| [ OK ]           | Source file program.bmq found              |
| [ OK ]           | Found target board: alveou55c              |
| [ OK ]           | Project has been successfully validate.    |
| [ Command >      |                                            |
| bmhelper apply   |                                            |
| [ Output >       |                                            |
| [ OK ]           | Workflow detected: quantum.                |
| [ OK ]           | Mandatory variable found SOURCE_QUANTUM    |
| [ OK ]           | Mandatory variable found WORKING_DIR       |
| [ OK ]           | Mandatory variable found MAPFILE           |
| [ OK ]           | Optional variable found: SHOWARGS          |
| [ OK ]           | Source file program.bmq found              |
| [ OK ]           | Found target board: alveou55c              |
| [ OK ]           | Project has been successfully initialized. |
|                  |                                            |

```
Command >
bmhelper apply
                Workflow detected: quantum.
                Mandatory variable found SOURCE OUANTUM
                Mandatory variable found WORKING DIR
                Mandatory variable found MAPFILE
                Optional variable found: SHOWARGS
  0K 1
                Source file program.bmg found
  0K 1
                Found target board: alveou55c
                Project has been successfully initialized.
 Command >
make bondmachine
         Example] - [Working directory creation begin] - [Target: working_dir]
mkdir -p working dir
    ject: Example] - [Working directory creation end]
    ect: Example] - [BondMachine generation begin] - [Target: working dir/bondmachine target]
bmgsim -save-basm working dir/bondmachine.basm -build-matrix-seg-hardcoded -hw-flavor seg hardcoded
complex -app-flavor cpp opencl complex -build-app -app-file working dir/circuit.c -emit-bmapi-maps
bmapi-maps-file bmapi.json program.bmg ; basm -disable-dynamical-matching -bo working dir/bondmach
ine.bcof -chooser-min-word-size -chooser-force-same-name -dump-requirements working dir/requirements
 ison -o working dir/bondmachine.ison working dir/bondmachine.basm
  roject: Example] - [BondMachine generation end]
```

ICSC FPGA Course, 27/06/2025

```
bmgsim _save-basm working dir/bondmachine.basm -build-matrix-seg-hardcoded -hw-flavor seg hardcoded
complex -app-flavor cpp opencl complex -build-app -app-file working dir/circuit.c -emit-bmapi-maps
-bmapi-maps-file bmapi.ison program.bmg : basm -disable-dynamical-matching -bo working dir/bondmach
ine.bcof -chooser-min-word-size -chooser-force-same-name -dump-requirements working dir/requirements
.ison -o working dir/bondmachine.ison working dir/bondmachine.basm
[Project: Example] - [BondMachine generation end]
 Command >
make hdl
    ject: Example] - [HDL generation begin] - [Target: working dir/hdl target]
bondmachine -bondmachine-file working dir/bondmachine.json -create-verilog -verilog-mapfile alveou55
 maps.json -verilog-flavor alveou55c -use-bmapi -bmapi-flavor axist -bmapi-language pyth
on -bmapi-mapfile bmapi.ison -bmapi-liboutdir working dir/bmapi -bmapi-framework pyng -bmapi-flavor-
version basic -bmapi-modoutdir working_dir/rtl bondmachine -bmapi-generate-example notebook.ipynb
comment-verilog -bcof-file working dir/bondmachine.bcof _bmrequirements-file working dir/requiremen
ts.json -hw-optimizations onlydestregs,onlysrcregs
echo > working dir/bondmachine.sv
for i in `ls \tilde{*}.v | sort -d` : do cat $i >> working dir/bondmachine.sv : done
rm -f *.v
echo > working dir/bondmachine.vhd
for i in `ls *.vhd | sort -d` : do cat $i >> working dir/bondmachine.vhd : done
ls: cannot access '*.vhd': No such file or directory
rm -f *.vhd
 Project: Example] - [HDL generation end]
```

.json -o working\_dir/bondmachine.json working\_dir/bondmachine.basm [Project: Example] - [BondMachine generation end]

```
Command >
make hdl
        Example] - [HDL generation begin] - [Target: working dir/hdl target]
bondmachine -bondmachine-file working dir/bondmachine.json -create-verilog -verilog-mapfile alveou55
on -bmapi-mapfile bmapi.json -bmapi-liboutdir working dir/bmapi -bmapi-framework pyng -bmapi-flavor-
version basic -bmapi-modoutdir working dir/rtl bondmachine -bmapi-generate-example notebook.ipynb
comment-verilog -bcof-file working dir/bondmachine.bcof _bmrequirements-file working dir/requiremen
ts.ison -hw-optimizations onlydestreas.onlysrcreas
echo > working dir/bondmachine.sv
for i in `ls \overline{*.}v | sort -d` ; do cat $i >> working dir/bondmachine.sv ; done
rm -f * v
echo > working dir/bondmachine.vhd
for i in `ls *.vhd | sort -d` ; do cat $i >> working dir/bondmachine.vhd ; done
ls: cannot access '*.vhd': No such file or directory
rm -f *.vhd
 Project: Example] - [HDL generation end]
 Command >
make xclbin
```

ICSC FPGA Course, 27/06/2025

INF0: [v++ 60-1306] Additional information associated with this v++ package can be found at: Reports: /tmp/tmp577nekug/Example/working\_dir/rtl\_bondmachine/\_x/reports/package Log files: /tmp/tmp577nekug/Example/working dir/rtl bondmachine/ x/logs/package Running Dispatch Server on port: 46409 INF0: [v++ 60-1548] Creating build summary session with primary output /tmp/tmp577nekug/Example/work ing dir/rtl bondmachine/build dir.hw.xilinx u55c gen3x16 xdma 3 202210 1/bondmachine.xclbin.package summarv. at Wed Jun 5 20:23:31 2024 INFO: [v++ 60-1315] Creating rulecheck session with output '/tmp/tmp577nekug/Example/working dir/rtl bondmachine/ x/reports/package/v++ package bondmachine guidance.html', at Wed Jun 5 20:23:31 2024 INFO: [v++ 60-895] Target platform: /opt/xilinx/platforms/xilinx u55c gen3x16 xdma 3 202210 1/xili nx u55c gen3x16 xdma 3 202210 1.xpfm INF0: [v++ 60-1578] This platform contains Xilinx Shell Archive '/opt/xilinx/platforms/xilinx u55c gen3x16 xdma 3 202210 1/hw/hw.xsa' INFO: [v++ 74-78] Compiler Version string: 2023.2 INF0: [v++ 60-2256] Packaging for hardware INF0: [v++ 60-2460] Successfully copied a temporary xclbin to the output xclbin: /tmp/tmp577nekug/Ex ample/working dir/rtl bondmachine/./build dir.hw.xilinx u55c gen3x16 xdma 3 202210 1/bondmachine.xcl bin INFO: [v++ 60-2343] Use the vitis\_analyzer tool to visualize and navigate the relevant reports. Run the following command. vitis analyzer /tmp/tmp577nekug/Example/working dir/rtl bondmachine/build dir.hw.xilinx u55c gen 3x16 xdma 3 202210 1/bondmachine.xclbin.package summary INFO: [v++ 60-791] Total elapsed time: Oh Om 6s INFO: [v++ 60-1653] Closing dispatch client. roject: Example] - [Vivado toolchain - xclbin creation end]

```
INFO: [v++ 74-78] Compiler Version string: 2023.2
INF0: [v++ 60-2256] Packaging for hardware
INF0: [v++ 60-2460] Successfully copied a temporary xclbin to the output xclbin: /tmp/tmp577nekug/Ex
ample/working dir/rtl bondmachine/./build dir.hw.xilinx u55c gen3x16 xdma 3 202210 1/bondmachine.xcl
hin
INFO: [v++ 60-2343] Use the vitis analyzer tool to visualize and navigate the relevant reports. Run
the following command.
    vitis analyzer /tmp/tmp577nekug/Example/working dir/rtl bondmachine/build dir.hw.xilinx u55c gen
3x16 xdma 3 202210 1/bondmachine.xclbin.package summary
INFO: [v++ 60-791] Total elapsed time: Oh Om 6s
INFO: [v++ 60-1653] Closing dispatch client.
 Project: Example] - [Vivado toolchain - xclbin creation end]
  Command >
make deploy xclbin
            kample] - [BondMachine deploy xclbin begin] - [Target: deploy xclbin]
                   - [BondMachine deploy local]
if [ -d /home/mirko/alveoruns/Example ]: then rm -rf /home/mirko/alveoruns/Example: fi
if [ -d /home/mirko/alveoruns/template ]: then cp -a /home/mirko/alveoruns/template /home/mirko/alve
oruns/Example: fi
cp working dir/rtl bondmachine/build dir.hw.xilinx u55c gen3x16 xdma 3 202210 1/bondmachine.xclbin /
home/mirko/alveoruns/Example/firmware.xclbin
cp working_dir/circuit.c /home/mirko/alveoruns/Example/
                  - [BondMachine deploy xclbin end]
```

```
the following command.
   vitis analyzer /tmp/tmp577nekug/Example/working dir/rtl bondmachine/build dir.hw.xilinx u55c gen
3x16 xdma 3 202210 1/bondmachine.xclbin.package summary
INFO: [v++ 60-791] Total elapsed time: Oh Om 6s
INFO: [v++ 60-1653] Closing dispatch client.
 Project: Example1 - [Vivado toolchain - xclbin creation end]
 Command >
make deploy xclbin
           Example] - [BondMachine deploy xclbin begin] - [Target: deploy xclbin]
                   - [BondMachine deploy local]
  [ -d /home/mirko/alveoruns/Example ]: then rm -rf /home/mirko/alveoruns/Example: fi
if [ -d /home/mirko/alveoruns/template ]; then cp -a /home/mirko/alveoruns/template /home/mirko/alve
oruns/Example; fi
cp working dir/rtl bondmachine/build dir.hw.xilinx u55c gen3x16 xdma 3 202210 1/bondmachine.xclbin /
home/mirko/alveoruns/Example/firmware.xclbin
cp working dir/circuit.c /home/mirko/alveoruns/Example/
  roject: Example] - [BondMachine deploy xclbin end]
 Command >
bmgsim -software-simulation program.bmg
[{"Vector":[{"Real":0.49999997."Imag":0}.{"Real":0."Imag":0}.{"Real":0.49999997."Imag":0}.{"Real":0.
'Imag":0}.{"Real":0.49999997."Imag":0}.{"Real":0."Imag":0}.{"Real":0.4999997."Imag":0}.{"Real":0.4999997."
mag":0}]}]
```

```
INFO: [v++ 60-791] Total elapsed time: Oh Om 6s
INFO: [v++ 60-1653] Closing dispatch client.
   roject: Example1 - [Vivado toolchain - xclbin creation end]
[ Command >
make deploy xclbin
                le] - [BondMachine deploy xclbin begin] - [Target: deploy xclbin]
                   - [BondMachine deploy local]
if [ -d /home/mirko/alveoruns/Example ]; then rm -rf /home/mirko/alveoruns/Example; fi
if [ -d /home/mirko/alveoruns/template ]; then cp -a /home/mirko/alveoruns/template /home/mirko/alve
oruns/Example: fi
cp working dir/rtl bondmachine/build dir.hw.xilinx u55c gen3x16 xdma 3 202210 1/bondmachine.xclbin /
home/mirko/alveoruns/Example/firmware.xclbin
cp working_dir/circuit.c /home/mirko/alveoruns/Example/
  roject: Example] - [BondMachine deploy xclbin end]
 Command >
bmgsim -software-simulation program.bmg
[{"Vector":[{"Real":0.49999997."Imag":0}.{"Real":0."Imag":0}.{"Real":0.49999997."Imag":0}.{"Real":0.
 'Imag":0}.{"Real":0.49999997."Imag":0}.{"Real":0."Imag":0}.{"Real":0.49999997."Imag":0}.{"Real":0."I
mag":0}1}1
 Command >
cd /home/mirko/alveoruns/proj alveou55c teleport/
source /opt/xilinx/xrt/setup.sh
```

ls/Xilinx/Vitis/2023.2/gnu/microblaze/linux\_toolchain/lin64\_le/bin:/tools/Xilinx/Vitis/2023.2/gnu/aa rch32/lin/gcc-arm-linux-gnueabi/bin:/tools/Xilinx/Vitis/2023.2/gnu/aarch32/lin/gcc-arm-none-eabi/bin :/tools/Xilinx/Vitis/2023.2/gnu/aarch64/lin/aarch64-linux/bin:/tools/Xilinx/Vitis/2023.2/gnu/aarch64/ /lin/aarch64-none/bin:/tools/Xilinx/Vitis/2023.2/gnu/armr5/lin/gcc-arm-none-eabi/bin:/tools/Xilinx/V itis/2023.2/tps/lnx64/cmake-3.3.2/bin:/tools/Xilinx/Vitis/2023.2/aietools/bin:/tools/Xilinx/Vitis/20 23.2/gnu/riscy/lin/riscy64-unknown-elf/bin:/tools/Xilinx/Vivado/2023.2/bin:/tools/Xilinx/DocNav:/opt /xilinx/xrt/bin:/tools/Xilinx/Vitis HLS/2023.2/bin:/tools/Xilinx/Model Composer/2023.2/bin:/tools/Xi linx/Vitis/2023.2/bin:/tools/Xilinx/Vitis/2023.2/gnu/microblaze/lin/bin:/tools/Xilinx/Vitis/2023.2/g nu/microblaze/linux toolchain/lin64 le/bin:/tools/Xilinx/Vitis/2023.2/gnu/aarch32/lin/gcc-arm-linuxgnueabi/bin:/tools/Xilinx/Vitis/2023.2/gnu/aarch32/lin/gcc-arm-none-eabi/bin:/tools/Xilinx/Vitis/202 3.2/gnu/aarch64/lin/aarch64-linux/bin:/tools/Xilinx/Vitis/2023.2/gnu/aarch64/lin/aarch64-none/bin:/t ools/Xilinx/Vitis/2023.2/gnu/armr5/lin/gcc-arm-none-eabi/bin:/tools/Xilinx/Vitis/2023.2/tps/lnx64/cm ake-3.3.2/bin:/tools/Xilinx/Vitis/2023.2/aietools/bin:/tools/Xilinx/Vitis/2023.2/gnu/riscv/lin/riscv 64-unknown-elf/bin:/tools/Xilinx/Vivado/2023.2/bin:/tools/Xilinx/DocNav:/tools/Xilinx/Vitis HLS/2023 .2/bin:/tools/Xilinx/Model Composer/2023.2/bin:/tools/Xilinx/Vitis/2023.2/bin:/tools/Xilinx/Vitis/20 23.2/gnu/microblaze/lin/bin:/tools/Xilinx/Vitis/2023.2/gnu/microblaze/linux toolchain/lin64 le/bin:/ tools/Xilinx/Vitis/2023.2/gnu/aarch32/lin/gcc-arm-linux-gnueabi/bin:/tools/Xilinx/Vitis/2023.2/gnu/a arch32/lin/gcc-arm-none-eabi/bin:/tools/Xilinx/Vitis/2023.2/gnu/aarch64/lin/aarch64-linux/bin:/tools /Xilinx/Vitis/2023.2/gnu/aarch64/lin/aarch64-none/bin:/tools/Xilinx/Vitis/2023.2/gnu/armr5/lin/gcc-a rm-none-eabi/bin:/tools/Xilinx/Vitis/2023.2/tps/lnx64/cmake-3.3.2/bin:/tools/Xilinx/Vitis/2023.2/aie tools/bin:/tools/Xilinx/Vitis/2023.2/anu/riscv/lin/riscv64-unknown-elf/bin:/tools/Xilinx/Vivado/2023 .2/bin:/tools/Xilinx/DocNav:/usr/lib/xpra:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/bin:/bin:/bin:/bin in:/usr/games:/usr/local/games:/snap/bin:/home/mirko/.go/bin:/home/mirko/Workarea/Scripts:/usr/local /ao/bin

LD\_LIBRARY\_PATH

: /opt/xilinx/xrt/lib:/opt/xilinx/xrt/lib

: /opt/xilinx/xrt/python:/opt/xilinx/xrt/python

linx/Vitis/2023.2/bin:/tools/Xilinx/Vitis/2023.2/gnu/microblaze/lin/bin:/tools/Xilinx/Vitis/2023.2/g nu/microblaze/linux toolchain/lin64 le/bin:/tools/Xilinx/Vitis/2023.2/gnu/aarch32/lin/gcc-arm-linux-<u>qnueabi/bin:/tools/Xilinx/Vitis/2023.2/qnu/aarch32/lin/gcc-arm-none-eabi/bin:/tools/Xilinx/Vitis/202</u> 3.2/gnu/aarch64/lin/aarch64-linux/bin:/tools/Xilinx/Vitis/2023.2/gnu/aarch64/lin/aarch64-none/bin:/t ools/Xilinx/Vitis/2023.2/gnu/armr5/lin/gcc-arm-none-eabi/bin:/tools/Xilinx/Vitis/2023.2/tps/lnx64/cm ake-3.3.2/bin:/tools/Xilinx/Vitis/2023.2/aietools/bin:/tools/Xilinx/Vitis/2023.2/gnu/riscy/lin/riscy 64-unknown-elf/bin:/tools/Xilinx/Vivado/2023.2/bin:/tools/Xilinx/DocNav:/tools/Xilinx/Vitis HLS/2023 .2/bin:/tools/Xilinx/Model Composer/2023.2/bin:/tools/Xilinx/Vitis/2023.2/bin:/tools/Xilinx/Vitis/20 23.2/gnu/microblaze/lin/bin:/tools/Xilinx/Vitis/2023.2/gnu/microblaze/linux toolchain/lin64 le/bin:/ tools/Xilinx/Vitis/2023.2/gnu/aarch32/lin/gcc-arm-linux-gnueabi/bin:/tools/Xilinx/Vitis/2023.2/gnu/a arch32/lin/gcc-arm-none-eabi/bin:/tools/Xilinx/Vitis/2023.2/gnu/aarch64/lin/aarch64-linux/bin:/tools /Xilinx/Vitis/2023.2/gnu/aarch64/lin/aarch64-none/bin:/tools/Xilinx/Vitis/2023.2/gnu/armr5/lin/gcc-a rm-none-eabi/bin:/tools/Xilinx/Vitis/2023.2/tps/lnx64/cmake-3.3.2/bin:/tools/Xilinx/Vitis/2023.2/aie tools/bin:/tools/Xilinx/Vitis/2023.2/anu/riscv/lin/riscv64-unknown-elf/bin:/tools/Xilinx/Vivado/2023 in:/usr/games:/usr/local/games:/snap/bin:/home/mirko/.go/bin:/home/mirko/Workarea/Scripts:/usr/local /go/bin

LD LIBRARY PATH

PYTHONPATH

: /opt/xilinx/xrt/lib:/opt/xilinx/xrt/lib

: /opt/xilinx/xrt/python:/opt/xilinx/xrt/python

Command >

make

g++ -o circuit /home/mirko/Tests/Vitis Accel Examples/common/includes/xcl2/xcl2.cpp circuit.c -I/opt /xilinx/xrt/include -I/tools/Xilinx/Vivado/2023.2/include -Wall -O0 -g -std=c++1y -I/home/mirko/Test s/Vitis Accel Examples/common/includes/xcl2 -fmessage-length=0 -L/opt/xilinx/xrt/lib -pthread -lOpen CL -lrt -lstdc++

| CL -lrt -lstdc++<br>[ Command ><br>./circuit<br>[ Output ><br>Found Platform Name: XilinX<br>INF0: Reading firmware.xclbin<br>Loading: 'firmware.xclbin'<br>Device[0]: program successful!<br>1.000000 0.000000 0.000000 0.000000<br>0.5<br>0<br>0<br>0<br>0<br>0<br>0<br>0<br>0<br>0<br>0<br>0<br>0<br>0 | 0.000000 | 0.000000 | 0.00000 |  |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------|----------|---------|--|
|                                                                                                                                                                                                                                                                                                           |          |          |         |  |

ICSC FPGA Course, 27/06/2025

ŧ.



The BondMachine is a software ecosystem for the dynamical generation (from several HL types of origin) of computer architectures that can be synthesized of FPGA and

used as standalone devices,

as clustered devices,

and as firmware for computing accelerators.



ICSC FPGA Course, 27/06/2025

The BondMachine is a software ecosystem for the dynamical generation (from several HL types of origin) of computer architectures that can be synthesized of FPGA and

used as standalone devices,

as clustered devices,

and as firmware for computing accelerators.



ICSC FPGA Course, 27/06/2025

The BondMachine is a software ecosystem for the dynamical generation (from several HL types of origin) of computer architectures that can be synthesized of FPGA and

used as standalone devices,

as clustered devices,

and as firmware for computing accelerators.



ICSC FPGA Course, 27/06/2025

The BondMachine is a software ecosystem for the dynamical generation (from several HL types of origin) of computer architectures that can be synthesized of FPGA and

used as standalone devices,

as clustered devices,

and as firmware for computing accelerators.

ICSC FPGA Course, 27/06/2025

#### CCR 2015 First ideas, 2016 Poster, 2017 2022 2023 Talk

 InnovateFPGA 2018 Iron Award, Grand Final at Intel Campus (CA) USA

Invited lectures: "Advanced Workshop on

- Modern FPGA Based Technology for Scientific Computing<sup>II</sup>, ICTP 2019 and 2022
  - Invited lectures: "NiPS Summer School 2019"
  - Golab 2018 talk
  - Several other talks and posters, ISGC 2019, SOSC 2022, 2023, INFN ML Hackathon 2022
  - Article published on Parallel Computing, Elsevier 2022



ICSC FPGA Course, 27/06/2025

- CCR 2015 First ideas, 2016 Poster, 2017 2022
   2023 Talk
  - InnovateFPGA 2018 Iron Award, Grand Final at Intel Campus (CA) USA
- Modern FPGA Based Technology for Scientific Computing", ICTP 2019 and 2022
- Invited lectures: "NiPS Summer School 2019"
- Golab 2018 talk
- Several other talks and posters, ISGC 2019, SOSC 2022, 2023, INFN ML Hackathon 2022
- Article published on Parallel Computing, Elsevier 2022





- CCR 2015 First ideas, 2016 Poster, 2017 2022 2023 Talk
- InnovateFPGA 2018 Iron Award, Grand Final at Intel Campus (CA) USA
- Invited lectures: "Advanced Workshop on Modern FPGA Based Technology for Scientific Computing", ICTP 2019 and 2022
- Invited lectures: "NiPS Summer School 2019"
- Golab 2018 talk
- Several other talks and posters, ISGC 2019, SOSC 2022, 2023, INFN ML Hackathon 2022
- Article published on Parallel Computing, Elsevier 2022



ICSC FPGA Course, 27/06/2025

- CCR 2015 First ideas, 2016 Poster, 2017 2022 2023 Talk
- InnovateFPGA 2018 Iron Award, Grand Final at Intel Campus (CA) USA
- Invited lectures: "Advanced Workshop on Modern FPGA Based Technology for Scientific Computing", ICTP 2019 and 2022
- Invited lectures: "NiPS Summer School 2019"

Golab 2018 talk

- Several other talks and posters, ISGC 2019, SOSC 2022, 2023, INFN ML Hackathon 2022
- Article published on Parallel Computing, Elsevier 2022



The BondMachine Toolkit Enabling Machine Learning on FPGA

#### Mirko Mariotti

Department of Physics and Geology - University of Perugia INFN Perugia

NiPS Summer School 2019 Architectures and Algorithms for Energy-Efficient IoT and HPC Applications 3-6 September 2019 - Perugia





- CCR 2015 First ideas, 2016 Poster, 2017 2022
   2023 Talk
  - InnovateFPGA 2018 Iron Award, Grand Final at Intel Campus (CA) USA
  - Invited lectures: "Advanced Workshop on Modern FPGA Based Technology for Scientific Computing", ICTP 2019 and 2022
  - Invited lectures: "NiPS Summer School 2019"Golab 2018 talk
- Several other talks and posters, ISGC 2019, SOSC 2022, 2023, INFN ML Hackathon 2022
- Article published on Parallel Computing, Elsevier 2022



- CCR 2015 First ideas, 2016 Poster, 2017 2022 2023 Talk
- InnovateFPGA 2018 Iron Award, Grand Final at Intel Campus (CA) USA
- Invited lectures: "Advanced Workshop on Modern FPGA Based Technology for Scientific Computing", ICTP 2019 and 2022
- Invited lectures: "NiPS Summer School 2019"
- Golab 2018 talk
- Several other talks and posters, ISGC 2019, SOSC 2022, 2023, INFN ML Hackathon 2022
- Article published on Parallel Computing, Elsevier 2022



- CCR 2015 First ideas, 2016 Poster, 2017 2022 2023 Talk
- InnovateFPGA 2018 Iron Award, Grand Final at Intel Campus (CA) USA
- Invited lectures: "Advanced Workshop on Modern FPGA Based Technology for Scientific Computing", ICTP 2019 and 2022
- Invited lectures: "NiPS Summer School 2019"
- Golab 2018 talk
- Several other talks and posters, ISGC 2019, SOSC 2022, 2023, INFN ML Hackathon 2022
- Article published on Parallel Computing, Elsevier 2022

# Fabrics

The HDL code for the BondMachine has been tested on these devices/system: Digilent Basys3 - Xilinx Artix-7 - Vivado Kintex7 Evaluation Board - Vivado Digilent Zedboard and ebaz4205- Xilinx Zyng 7020 - Vivado ZC702 - Xilinx Zvng 7020 - Vivado Alveo boards - Xilinx - Vivado/Vitis Linux - Iverilog ice40lp1k icefun icebreaker icesugarnano - Lattice - Icestorm Terasic De10nano - Intel Cyclone V - Quartus Arrow Max1000 - Intel Max10 - Quartus Within the project other firmware have been written or tested:

- Microchip ENC28J60 Ethernet interface controller.
- Microchip ENC424J600 10/100 Base-T Ethernet interface controller.
- ESP8266 Wi-Fi chip.



Parallel Computing - https://doi.org/10.1016/j.parco.2021.102873

#### Use cases

Two use cases in Physics experiments are currently being developed:
Real time pulse shape analysis in neutron detectors
bringing the intelligence to the edge
Test beam for space experiments
increasing testbed operations efficiency
And not only in Physics:
Machine learning accelerators
Ultra low latency inference

- Edge computing
  - Power efficiency for IoT
  - Heterogeneous computing
  - Exotic HW/SW/OS architectures
    - Research in innovative OS design





How to build a BondMachine with a close interaction with the host machine

A shell-like BM application from Jupyter

The BondMachine is a new kind of computing device made possible in practice only by the emerging of new re-programmable hardware technologies such as FPGA.

The result of this process is the construction of a computer architecture that is not anymore a static constraint where computing occurs but its creation becomes a part of the computing process, gaining computing power and flexibility.

Over this abstraction is it possible to create a full computing Ecosystem, ranging from small interconnected IoT devices to Machine Learning accelerators or Quantum Computing.

Conclusions



- Complete the inclusion of Intel and Lattice FPGAs
- FPGA to FPGA communication
- Fist steps in the direction of a full OS



Different data types and operations, especially low and trans-precision

Different boards support, especially data center accelerator

Comparison with GPUs



More datasets: test on other datasets with more features and multiclass classification

Neurons: increase the library of neurons to support other activation functions

**Evaluate results**: compare the results obtained with other technologies (CPU and GPU) in terms of inference speed and energy efficiency



Backends: support for different quantum backends

Symbolic backend: full integration of the symbolc backend

Optimization: include optimization techniques for quantum circuits

# Future Work

Machine Learning

- More tests and work on numerical precision
  - add more numeric types and try more numerical precisions
  - try more quantization technique
  - improve fixed point precision

Consolidate the work done and improve portability

- > extend the automatisms and finalize the implementation on Alveo
- make everything adaptable for FPGA clusters (BM is a multi-fpga system)
- support more boards to spread our solution
- test our solutions on ICSC resources
- New estimates on energy consumption
  - Move from software energy estimates to real energy measurements
- For the cloud service implementation...
  - leveraging the kserve extension also for use cases beyond inference
  - FPGA bookkeeping
  - Systematic measurements of performances at the various stage of the chain



Major rewrite of the compiler to include more data structures and use the new assembler (BASM)

Assembler improvements, fragments optimization and others advanced features

Improve the networking including new kind of interconnection firmware



Major rewrite of the compiler to include more data structures and use the new assembler (BASM)

Assembler improvements, fragments optimization and others advanced features

Improve the networking including new kind of interconnection firmware



Major rewrite of the compiler to include more data structures and use the new assembler (BASM)

Assembler improvements, fragments optimization and others advanced features

Improve the networking including new kind of interconnection firmware



Major rewrite of the compiler to include more data structures and use the new assembler (BASM)

Assembler improvements, fragments optimization and others advanced features

Improve the networking including new kind of interconnection firmware



Major rewrite of the compiler to include more data structures and use the new assembler (BASM)

Assembler improvements, fragments optimization and others advanced features

Improve the networking including new kind of interconnection firmware



website: https://www.bondmachine.it code: https://github.com/BondMachineHQ parallel computing paper: link contact email: mirko.mariotti@unipg.it

ICSC FPGA Course, 27/06/2025