

### Energy-Efficient FPGA Solutions for Large-Scale FFTs and Non-Uniform FFTs

A Software-Hardware Co-Design Approach for Radio Interferometry

<u>Rubén Rodríguez Álvarez</u>, Denisa Constantinescu, Miguel Peón Quirós, Adrien Devresse, Hamza Chouh, Shreyam Krishna, Etienne Orliac, David Atienza

EPFL - Embedded Systems Laboratory, EcoCloud, SCITAS

ruben.rodriguezalvarez@epfl.ch, denisa.constantinescu@epfl.ch



## **SEAMS Project**



Pipelines Profiling & Specification

Single-Node SW/HW co-design

Multi-Node Scale-up

Integration and Testing

**On-Field Demonstrator** 



Scope

**Energy-efficient** computing with domain-specific accelerators Multi-scale hardware-software co-design approach

#### **Group Members**

### France

- INSA Rennes: Jean-F. Nezan, Mickaël Dardaillon, Hugo Miomandre, Jacques Morin
- OCA: Shan Mignot, Alain Miniussi, Chiara Ferrari, André Ferrari
- OP: Damien Gratadour

### Switzerland

- EcoCloud: Miguel Peon Quiros, David Atienza
- ESL: Denisa Constantinescu, Rubén R. Álvarez, Basile Darne, David Atienza
- SCITAS: Adrien Devresse, Hamza Chouh, Etienne Orliac, Gilles Fourestey

#### Partners: EPFL, Laboratory of Astrophysics, MeerKAT, SKACH, SKAO





















## Synthesis (NUFFTs)







- Kashani et al. "HVOX: Scalable Interferometric Synthesis and Analysis of Spherical Sky Maps." (2023).
- Tolley et al. "BIPP: An efficient HPC implementation of the Bluebild algorithm for radio astronomy." (2023).
- Corda et al. "Near memory acceleration on high resolution radio astronomy imaging." MECO. IEEE, (2020).

### How do these algorithms map into an FPGA?

FFTs and NUFFTs





### Finufft Synthesis





### 3D FFT takes 40%-90% of the computation

Intro

FFTs and NUFFTs SW/

SW/HW Co-Design

7







| Characteristics              | Agilex 7 M-Series Dev Kit                 | Alveo V80 Card          |
|------------------------------|-------------------------------------------|-------------------------|
| Internal memory              | 370Mb BRAM                                | 132Mb BRAM + 541Mb URAM |
| High Bandwith Memory (HBM2e) | <b>32GB @ 1T</b> B/s                      | 32GB @ 810GB/s          |
| Compute Elements             | 3.9M LEs + 12.3K DSPs + 1.3M ALMs         | 2.6M LUTs + 10.8K DSPs  |
| Max Power (TDP)              | (2x) 240 Watts                            | 190 Watts               |
| Global Memory (DDR4/5)       | <b>64</b> GB                              | <b>32</b> GB            |
| Comms                        | 16x PCIe 5, CXL, GbE 116Gbps, fiber optic | 2x PCIe 5               |
| Technology                   | 7nm Intel                                 | 7nm TSMC                |
| Max Clock Freq               | 500MHz-1GHz                               | 600MHz-1GHz             |







SW/HW Co-Design

## High-Level Synthesis (HLS) for FPGAs





### **Characteristics:**

- Mixed precision data types
- Parallel, pipeline and serial
- Resources constraints
- Code breakdown
- Highly parametrizable

| We teach HLS and Co-Design; used it to accelerate |
|---------------------------------------------------|
| CNNs and genome alignment applications            |

HLS is a good fit for changing SW, portable HW, and design explorations

| Characteristics            | HLS FPGA | CUDA GPU   |
|----------------------------|----------|------------|
| Programming support        | High     | High       |
| Productivity (design time) | Medium   | High       |
| Energy Efficiency          | High     | Low-Medium |
| Latency                    | Medium   | Low        |
| Scalability                | High     | High       |
| Flexibility                | High     | Limited    |

EPFL

FFTs and NUFFTs











### EPFL

# parallel FFTs

FFT stages

Data format

Transpose buffer

FFT Max FFT size

## FFT HW Design and Exploration









Consecutive transfers to memory takes less time and energy









# Precision in for FINUFFT Synthesis (BIPP)

**EPFL** 



Sample data extracted from bipp execution, simulated with OSKAR for SKA-Low configuration



# Precision in for FINUFFT Synthesis (BIPP)

EPFL



Sample data extracted from bipp execution, simulated with OSKAR for SKA-Low configuration Precision of Floating-Point Formats



SW/HW Co-Design

FFTs and NUFFTs

- Sample Data
- Undesired Precision
- Requirement Range
- Valid Precision
- Real data
- half (FP16)
- float (FP32)
- Custom FP40
- Custom FP42
- double (FP64)

### For 32x8196x8196:

Conclusion

Results





## Precision in for FINUFFT Synthesis (BIPP)



Sample data extracted from bipp execution, simulated with OSKAR for SKA-Low configuration

Precision of Fixed-Point Formats





## Conclusion & Follow Up



### Done

- Deploy flexible algorithms using FPGAs
- Accelerate kernels with an FPGA
- Explore the Design Space
- Share resources among different kernels

### **Ongoing Exploration**

- FPGAs improve the energy consumption
- FPGAs match (even increase) the performance of GPUs
- Custom precision data formats are beneficial
- Solve memory-bounded workloads in FPGAs
- Reconfigure the FPGA at run-time for dynamic workloads

### **Inputs Needed**

- Other algorithms to accelerate (i.e. ML)
- Dynamic range of real data (at different stages)
- Precision & latency requirements for different use cases
- Precision metrics (i.e. SNR)
- Scalability of algorithms

Conclusion





# Thank you!

### Ruben

EPFL - Embedded Systems Laboratory ruben.rodriguezalvarez@epfl.ch denisa.constantinescu@epfl.ch



# Backup Slides





## Types of NUFFTs algorithms



| Method             | Spread                                                    | FFT                                 | Interpolation                                              |
|--------------------|-----------------------------------------------------------|-------------------------------------|------------------------------------------------------------|
| NUFFT <sub>1</sub> | $N_{\rm vis} \left \log \epsilon\right ^2$                | $N_{\rm pix} \log N_{\rm pix}$      | $N_{ m pix}$                                               |
| W-gridding         | $N_{\rm vis} \left \log \epsilon\right ^3$                | $N_{w'}N_{\rm pix}\log N_{\rm pix}$ | $N_{w'}N_{ m pix}$                                         |
| NUFFT <sub>3</sub> | $N_{\rm vis} \left \log \epsilon\right ^3 + N_{\rm mesh}$ | $N_{\rm mesh} \log N_{\rm mesh}$    | $N_{\rm pix} \left  \log \epsilon \right ^3 + N_{\rm pix}$ |



Kashani, Sepand, et al. "HVOX: Scalable Interferometric Synthesis and Analysis of Spherical Sky Maps." *arXiv preprint arXiv:2306.06007* (2023).