# **Fast W-Projection Gridding on GPUs**

John W. Romein

Stichting ASTRON (Netherlands Institute for Radio Astronomy) Dwingeloo, the Netherlands



















# **Two Problems**



# lots of FLOPS add to memory: slow!

grid (~4096x4096)



CALIM '11



# **Two Solutions**





CALIM '11





- GPU introduction
- W-projection gridding on GPUs
- performance results















### **GPUs**

- powerful compute device
- highly parallel

on for Scientific Research

- device memory
  - "limited" bandwidth



|                  | CPU <i>(E5620)</i> | GPU <i>(GTX 580)</i> | GPU / CPU |
|------------------|--------------------|----------------------|-----------|
| cores            | 4                  | 512                  | 128       |
| threads          | 8                  | 16,384               | 2,048     |
| vector length    | 4                  |                      |           |
| GFLOPS           | 76.8               | 1,581                | 20.6      |
| memory BW (GB/s) | 25.6               | 192.4                | 7.52      |
| TDP (W)          | 80                 | 244                  | 3.05      |
|                  |                    |                      |           |
| WO C             | ALIM '11           | July 25-29, 2011     | 7 AST R   |

# **GPU Compute Model**

#### model:

- move data CPU → GPU
- run kernel on GPU
- move result GPU → CPU
- PCIe often bottleneck

### overlap computations and communication







# **GPU Features**

core hierarchy: 16 multi-processors (SMs) of 32 cores

- SMs independent
- cores in SM cooperate
  - SIMD
  - coalescing
  - □ latency hiding:  $\leq$  32 threads/core

#### textures

efficient 2D/3D caching

**CALIM** '11

interpolation (indexed by floating point number)







# **GPU Memories**



**CALIM** '11

July 25-29, 2011

**AST**(RON

10



# **GPU Programming**

CPU code in C/C++
GPU code in CUDA or OpenCL







# **GPU Languages**

### OpenCL

- Nvidia, AMD, ...
- CPU side: C horrible, C++ very pleasant

CUDA

- Nvidia only
- better support for latest GPU features
- 2%~20% faster
- matured more







# **CUDA Example**

```
device float array[1024];
 global void zero_array()
  array[threadIdx.x] = 0;
}
int main()
{
  zero_array<<<1, 1024>>>();
  return 0;
```



CALIM '11



# **Back To Gridding**







# **W-Projection Gridding**



NW

**CALIM** '11

# Where Is The Data?





conv. matrices: texture

grid: device memory

□ vis. + (u,v,w): shared memory



CALIM '11



### **Placement Movement**





### (u,v,w) changes <u>slowly</u>

grid locality

per baseline:



CALIM '11



# **Use Locality**





- reduce #memory accesses
- X: one thread
- accumulate additions in register
- until conv. matrix slides off



CALIM '11



### But How ???



1 thread / grid point



- which visibilities contribute?
- severe load imbalance



CALIM '11







#### conceptual blocks of conv. matrix size



CALIM '11







# 1 thread monitors all X at any time: conv. matrix covers <u>1</u> X!!!



CALIM '11







- thread computes current:
   X grid point
  - X conv. matrix entry



CALIM '11





July 25-29, 2011

23 AST(RON

#### (atomically) adds data if moved to another X

**CALIM** '11







CALIM '11



# (Dis)Advantages





CALIM '11



# **Work Distribution**

- baselines: spread over SMs
- times: threads in SM
- frequencies: threads in SM
- polarizations: single thread







## **Performance Measurements**







# **Performance Tests Setup**

| #stations         | 44        |  |
|-------------------|-----------|--|
| #channels         | 16        |  |
| integration time  | 10 s      |  |
| observation time  | 6 h       |  |
| conv. matrix size | ≤ 128x128 |  |
| oversampling      | 8x8       |  |
| #W-planes         | 128       |  |
| grid size         | 4096x4096 |  |

 $\Box$  (*u*,*v*,*w*) from real observation (6 hour)







# **CUDA Performance**



**CALIM** '11





### **#Threads**



#### 128x128 conv. matrix



#### CALIM '11



# **OpenCL Performance**

31

July 25-29, 2011

AS

#### language bit restrictive

- no 1D textures
- no atomic add -> use atomic cmpxchg
- Nvidia GTX 580
  - 18% slower than CUDA
  - multi-GPU/host-threads issues
- AMD HD 6970
  - limited grid size (2048 x 2048)
  - 13-163x slower than GTX 580!

**CALIM** '11

atomic ops slow



# **Multi-GPU Scaling**

### eight Nvidia GTX 580s





- 131,072 threads!
- scales perfectly
- 296x faster than dual CPU



**CALIM** '11



# **Green Computing**



28x more energy efficient than dual CPU

**CALIM '11** 

AST(RON

33



### **Comparison With Other GPU Gridders**

#### van Amesfoort et. al. [CF'09]

- private grid per block -> very small grids
- 3.5~6.5 x (compensated for faster hardware)
- □ MWA gridder (Edgar et. al. [CPC'11])
  - search visibilities that potentially add to grid point
     6.1~8.0 x

July 25-29, 2011

34

- Humphreys & Cornwell [SKA memo 132, '11]
  - adds directly to grid in memory

**CALIM** '11

□ 8.5~10.3 x



# **Future Work**

#### work in progress

- performance counters
- use hardware interpolation instead of oversampling/W-planes
- LOFAR gridder
  - combine with A-projection







# Conclusions

- efficient GPU gridding algorithm
  - minimize memory accesses
- CUDA more mature than OpenCL
- 6~10x faster than other gridders
- 37x faster than dual CPU
  - scales perfectly on 8 GPUs
  - energy efficient





