4 CUDA Fluids Implementation
The main changes needed to port a CPU implementation to a CUDA implementation are described in this section. First, we present alternatives to CPU interpolation and the CPU fftw library. Second, we explain how to access OpenGL vertex buffer from a kernel. After that, we need to allocate memory in the GPU device that the kernels will use for the simulation step computation. Finally, we have to decide what thread layout we will use.
4.1 Libraries Alternatives
4.1.1 cuFFT
cuFFT is a CUDA library that provides the same functionality the fftw provides. The cuFFT was designed with the purpose of replicating the original fftw design. Even the memory layout can be set in a FFTW-compatible mode as the code used below shows.
// TODO: update kernels to use the new unpadded memory layout for perf
// rather than the old FFTW-compatible layout
cufftSetCompatibilityMode(planr2c, CUFFT_COMPATIBILITY_FFTW_PADDING);
cufftSetCompatibilityMode(planc2r, CUFFT_COMPATIBILITY_FFTW_PADDING);
This has one drawback, the layout will not follow the high-performance design we have previously thought it was possible to achieve with cudaMallocPitch. We can easily transform between real and complex numbers with cuFFT:
이때 한 가지 단점이 있다. 이 레이아웃은 이전에 cudaMallocPitch로 달성할 수 있다고 생각했던 고성능 디자인을 따르지 않는다. 우리는 cuFFT를 사용하여 실수나 복소수 사이를 쉽게 변환할 수 있다.
// real to complex (fordward FFT)
cufftExecR2C(planr2c, (cufftReal *)vx, (cufftComplex *)vx);
cufftExecR2C(planr2c, (cufftReal *)vy, (cufftComplex *)vy);
// complex to real (inverse FFT)
cufftExecC2R(planc2r, (cufftComplex *)vx, (cufftReal *)vx));
cufftExecC2R(planc2r, (cufftComplex *)vy, (cufftReal *)vy));
4.1.2 Textures
When accessing data stored in a texture though floating-point indexes, the returned values are implicitly interpolated when using a linear filter. For this reason, we will bind a texture (with a linear filter) to dvfield and access dvfield values through a texture instead of manually interpolating them. It is important to indicate that because how textures are designed to work with pixels (where each sample is positioned in the exact center of its corresponding pixel), when addressing for example through the index (0.5, 0.5) we are not interpolating the results from the vectors positioned at (0,0), (0,1), (1,0) and (1,1), we are only accessing at the velocity stored at the sample positioned at (0,0). In other words, we need to sum .5 to both indexes to get the interpolated value we were looking for. The original CPU manual interpolation, originally performed at the advection step, can be changed to:
부동소수점 인덱스를 통해 텍스처에 저장된 데이터에 액세스할 때, 선형 필터를 사용할 때 반환된 값은 암시적으로 보간된다. 이러한 이유로, 우리는 텍스처를 선형 필터로 dvfield에 바인딩하고 수동으로 보간하는 대신 텍스처를 통해 dvfield 값에 액세스할 것이다. 텍스처가 픽셀(각 샘플이 해당 픽셀의 정확한 중앙에 위치함)과 함께 작동하도록 설계되었기 때문에 인덱스(0.5, 0.5)를 통해 주소를 지정할 때 (0,0), (0,1), (1,0) 및 (1,1)에 위치한 벡터의 결과를 보간하지 않고 (0,0,0)에서만 액세스한다는 것을 나타내는 것이 중요하다. (0,0)에 위치한 샘플에 저장된 속도. 즉, 우리가 찾던 보간값을 얻기 위해서는 두 지수에 .5를 합해야 한다. 기존 advection 단계에서 수행되는 CPU 수동 보간은 다음과 같이 변경할 수 있다.
// first of all, update texture with dvfield velocity values
updateTexture(...);
// note how 0.5 is added to each calculated array index (gtidx and fi)
ploc.x = (gtidx + 0.5f) - (dt * vterm.x * dx);
ploc.y = (fi + 0.5f) - (dt * vterm.y * dy);
// vterm will hold the interpolated result
vterm = tex2D(texref, ploc.x, ploc.y);
4.2 Memory Allocation
Kernels need access to the velocity field memory. Kernels executed in a GPU device only have access to the device memory. In the CPU version, we allocated the velocity field in the hvfield with malloc. In the GPU version, we 8need to allocate the velocity field (dvfield, device velocity field) in the GPU device, with a CUDA function. For best performance, we allocate dvfield with cudaMallocPitch.
커널은 velocity 필드 메모리에 액세스해야 한다. GPU device에서 실행되는 커널은 device 메모리에만 액세스할 수 있다. CPU 버전에서는 hvfield의 속도 필드를 malloc으로 할당했다. GPU를 사용하는 방법에서는 CUDA 함수와 함께 GPU device의 velocity 필드(dvfield, device velocity field)를 할당해야 한다. 최상의 성능을 위해 cudaMallocPitch로 dvfield를 할당한다.
// Allocate and initialize device data
cudaMallocPitch((void **)&dvfield, &tPitch, sizeof(cData)*DIM, DIM);
The FFTW-compatible memory layout requires to ordinarily allocate memory with cudaMalloc. We will index this array the same way we did in the CPU version, using the pdx variable for the calculated padding.
FFTW 호환 메모리 레이아웃에서는 일반적으로 cudaMalloc을 사용하여 메모리를 할당해야 한다. 계산된 패딩에 대해 pdx 변수를 사용하여 CPU 버전과 동일한 방식으로 해당 배열을 인덱싱한다.
// Temporary complex velocity field data
cudaMalloc((void **)&vxfield, sizeof(cData) * PDS);
cudaMalloc((void **)&vyfield, sizeof(cData) * PDS);
4.3 Vertex Buffer Mapping
CUDA provides handy functions to access the vertex buffer object (vbo), making possible to map it to a (device) pointer as we did in the CPU version.
CUDA는 우리가 CPU 버전에서처럼 포인터(device)에 매핑할 수 있도록 vertex 버퍼 객체(vbo)에 접근할 수 있는 편리한 기능을 제공한다.
struct cudaGraphicsResource *cuda_vbo_resource; // handler
cData *p; // mapped vbo pointer
cudaGraphicsMapResources(1, &cuda_vbo_resource, 0);
cudaGraphicsResourceGetMappedPointer((void **)&p,
&num_bytes,cuda_vbo_resource);
// advect particles using the already mapped pointer
advectParticles_k<<<grid, tids>>>(p,...);
// dont forget to unmap!
cudaGraphicsUnmapResources(1, &cuda_vbo_resource, 0);
'GPU Programming > CUDA fluidsGL' 카테고리의 다른 글
Real-Time GPU Fluid Dynamics 번역 (3) (0) | 2022.07.20 |
---|---|
Real-Time GPU Fluid Dynamics 번역 (1) (0) | 2022.07.20 |
0. Introduction (0) | 2022.07.19 |