Computer lessons

Where cuda leads: practical application of gpgpu technology - the best hardware. See what "CUDA" is in other dictionaries Technology cuda what

And it is designed to translate host code (main, control code) and device code (hardware code) (files with the .cu extension) into object files suitable for the process of assembling the final program or library in any programming environment, for example in NetBeans.

The CUDA architecture uses a grid memory model, cluster thread modeling, and SIMD instructions. Applicable not only for high-performance graphics computing, but also for various scientific computing using nVidia video cards. Scientists and researchers widely use CUDA in a variety of fields, including astrophysics, computational biology and chemistry, fluid dynamics modeling, electromagnetic interactions, computed tomography, seismic analysis, and more. CUDA has the ability to connect to applications using OpenGL and Direct3D. CUDA is cross-platform software for operating systems such as Linux, Mac OS X and Windows.

On March 22, 2010, nVidia released CUDA Toolkit 3.0, which contained support for OpenCL.

Equipment

The CUDA platform first appeared on the market with the release of the eighth generation NVIDIA G80 chip and became present in all subsequent series of graphics chips that are used in the GeForce, Quadro and NVidia Tesla accelerator families.

The first series of hardware to support the CUDA SDK, the G8x, had a 32-bit single-precision vector processor using the CUDA SDK as an API (CUDA supports the C double type, but its precision has now been reduced to 32-bit floating point). Later GT200 processors have support for 64-bit precision (SFU only), but performance is significantly worse than for 32-bit precision (due to the fact that there are only two SFUs per stream multiprocessor, while there are eight scalar processors). The GPU organizes hardware multithreading, which allows you to use all the resources of the GPU. Thus, the prospect opens up to transfer the functions of the physical accelerator to the graphics accelerator (an example of implementation is nVidia PhysX). It also opens up wide possibilities for using computer graphics hardware to perform complex non-graphical calculations: for example, in computational biology and other branches of science.

Advantages

Compared to the traditional approach to organizing general-purpose computing through graphics APIs, the CUDA architecture has the following advantages in this area:

Restrictions

  • All functions executable on the device do not support recursion (CUDA Toolkit 3.1 supports pointers and recursion) and have some other limitations

Supported GPUs and graphics accelerators

The list of devices from equipment manufacturer Nvidia with declared full support for CUDA technology is provided on the official Nvidia website: CUDA-Enabled GPU Products (English).

In fact, the following peripherals currently support CUDA technology in the PC hardware market:

Specification version GPU Video cards
1.0 G80, G92, G92b, G94, G94b GeForce 8800GTX/Ultra, 9400GT, 9600GT, 9800GT, Tesla C/D/S870, FX4/5600, 360M, GT 420
1.1 G86, G84, G98, G96, G96b, G94, G94b, G92, G92b GeForce 8400GS/GT, 8600GT/GTS, 8800GT/GTS, 9600 GSO, 9800GTX/GX2, GTS 250, GT 120/30/40, FX 4/570, 3/580, 17/18/3700, 4700x2, 1xxM, 32 /370M, 3/5/770M, 16/17/27/28/36/37/3800M, NVS420/50
1.2 GT218, GT216, GT215 GeForce 210, GT 220/40, FX380 LP, 1800M, 370/380M, NVS 2/3100M
1.3 GT200, GT200b GeForce GTX 260, GTX 275, GTX 280, GTX 285, GTX 295, Tesla C/M1060, S1070, Quadro CX, FX 3/4/5800
2.0 GF100, GF110 GeForce (GF100) GTX 465, GTX 470, GTX 480, Tesla C2050, C2070, S/M2050/70, Quadro Plex 7000, Quadro 4000, 5000, 6000, GeForce (GF110) GTX 560 TI 448, GTX570, GTX580,
2.1 GF104, GF114, GF116, GF108, GF106 GeForce 610M, GT 430, GT 440, GTS 450, GTX 460, GTX 550 Ti, GTX 560, GTX 560 Ti, 500M, Quadro 600, 2000
3.0 GK104, GK106, GK107 GeForce GTX 690, GTX 680, GTX 670, GTX 660 Ti, GTX 660, GTX 650 Ti, GTX 650, GT 640, GeForce GTX 680MX, GeForce GTX 680M, GeForce GTX 675MX, GeForce GTX 670MX, GTX 660M, GeForce GT 650M, GeForce GT 645M, GeForce GT 640M
3.5 GK110
Nvidia GeForce for desktop computers
GeForce GTX 590
GeForce GTX 580
GeForce GTX 570
GeForce GTX 560 Ti
GeForce GTX 560
GeForce GTX 550 Ti
GeForce GTX 520
GeForce GTX 480
GeForce GTX 470
GeForce GTX 465
GeForce GTX 460
GeForce GTS 450
GeForce GTX 295
GeForce GTX 285
GeForce GTX 280
GeForce GTX 275
GeForce GTX 260
GeForce GTS 250
GeForce GT 240
GeForce GT 220
GeForce 210
GeForce GTS 150
GeForce GT 130
GeForce GT 120
GeForce G100
GeForce 9800 GX2
GeForce 9800 GTX+
GeForce 9800 GTX
GeForce 9800 GT
GeForce 9600 GSO
GeForce 9600 GT
GeForce 9500 GT
GeForce 9400 GT
GeForce 9400 mGPU
GeForce 9300 mGPU
GeForce 8800 GTS 512
GeForce 8800 GT
GeForce 8600 GTS
GeForce 8600 GT
GeForce 8500 GT
GeForce 8400GS
Nvidia GeForce for mobile computers
GeForce GTX 580M
GeForce GTX 570M
GeForce GTX 560M
GeForce GT 555M
GeForce GT 540M
GeForce GT 525M
GeForce GT 520M
GeForce GTX 485M
GeForce GTX 480M
GeForce GTX 470M
GeForce GTX 460M
GeForce GT 445M
GeForce GT 435M
GeForce GT 425M
GeForce GT 420M
GeForce GT 415M
GeForce GTX 285M
GeForce GTX 280M
GeForce GTX 260M
GeForce GTS 360M
GeForce GTS 350M
GeForce GTS 160M
GeForce GTS 150M
GeForce GT 335M
GeForce GT 330M
GeForce GT 325M
GeForce GT 240M
GeForce GT 130M
GeForce G210M
GeForce G110M
GeForce G105M
GeForce 310M
GeForce 305M
GeForce 9800M GTX
GeForce 9800M GT
GeForce 9800M GTS
GeForce 9700M GTS
GeForce 9700M GT
GeForce 9650MGS
GeForce 9600M GT
GeForce 9600MGS
GeForce 9500MGS
GeForce 9500M G
GeForce 9300MGS
GeForce 9300M G
GeForce 9200MGS
GeForce 9100M G
GeForce 8800M GTS
GeForce 8700M GT
GeForce 8600M GT
GeForce 8600MGS
GeForce 8400M GT
GeForce 8400MGS
Nvidia Tesla *
Tesla C2050/C2070
Tesla M2050/M2070/M2090
Tesla S2050
Tesla S1070
Tesla M1060
Tesla C1060
Tesla C870
Tesla D870
Tesla S870
Nvidia Quadro for desktop computers
Quadro 6000
Quadro 5000
Quadro 4000
Quadro 2000
Quadro 600
Quadro FX 5800
Quadro FX 5600
Quadro FX 4800
Quadro FX 4700 X2
Quadro FX 4600
Quadro FX 3700
Quadro FX 1700
Quadro FX 570
Quadro FX 470
Quadro FX 380 Low Profile
Quadro FX 370
Quadro FX 370 Low Profile
Quadro CX
Quadro NVS 450
Quadro NVS 420
Quadro NVS 290
Quadro Plex 2100 D4
Quadro Plex 2200 D2
Quadro Plex 2100 S4
Quadro Plex 1000 Model IV
Nvidia Quadro for mobile computing
Quadro 5010M
Quadro 5000M
Quadro 4000M
Quadro 3000M
Quadro 2000M
Quadro 1000M
Quadro FX 3800M
Quadro FX 3700M
Quadro FX 3600M
Quadro FX 2800M
Quadro FX 2700M
Quadro FX 1800M
Quadro FX 1700M
Quadro FX 1600M
Quadro FX 880M
Quadro FX 770M
Quadro FX 570M
Quadro FX 380M
Quadro FX 370M
Quadro FX 360M
Quadro NVS 5100M
Quadro NVS 4200M
Quadro NVS 3100M
Quadro NVS 2100M
Quadro NVS 320M
Quadro NVS 160M
Quadro NVS 150M
Quadro NVS 140M
Quadro NVS 135M
Quadro NVS 130M
  • Models Tesla C1060, Tesla S1070, Tesla C2050/C2070, Tesla M2050/M2070, Tesla S2050 allow GPU calculations with double precision.

Features and Specifications of Various Versions

Feature support (unlisted features are
supported for all compute capabilities)
Compute capability (version)
1.0 1.1 1.2 1.3 2.x

32-bit words in global memory
No Yes

floating point values ​​in global memory
Integer atomic functions operating on
32-bit words in shared memory
No Yes
atomicExch() operating on 32-bit
floating point values ​​in shared memory
Integer atomic functions operating on
64-bit words in global memory
Warp vote functions
Double-precision floating-point operations No Yes
Atomic functions operating on 64-bit
integer values ​​in shared memory
No Yes
Floating-point atomic addition operating on
32-bit words in global and shared memory
_ballot()
_threadfence_system()
_syncthreads_count(),
_syncthreads_and(),
_syncthreads_or()
Surface functions
3D grid of thread block
Technical specifications Compute capability (version)
1.0 1.1 1.2 1.3 2.x
Maximum dimensionality of grid of thread blocks 2 3
Maximum x-, y-, or z-dimension of a grid of thread blocks 65535
Maximum dimensionality of thread block 3
Maximum x- or y-dimension of a block 512 1024
Maximum z-dimension of a block 64
Maximum number of threads per block 512 1024
Warp size 32
Maximum number of resident blocks per multiprocessor 8
Maximum number of resident warps per multiprocessor 24 32 48
Maximum number of resident threads per multiprocessor 768 1024 1536
Number of 32-bit registers per multiprocessor 8K 16K 32 K
Maximum amount of shared memory per multiprocessor 16 KB 48 KB
Number of shared memory banks 16 32
Amount of local memory per thread 16 KB 512 KB
Constant memory size 64 KB
Cache working set per multiprocessor for constant memory 8 KB
Cache working set per multiprocessor for texture memory Device dependent, between 6 KB and 8 KB
Maximum width for 1D texture
8192 32768
Maximum width for 1D texture
reference bound to linear memory
2 27
Maximum width and number of layers
for a 1D layered texture reference
8192 x 512 16384 x 2048
Maximum width and height for 2D
texture reference bound to
linear memory or a CUDA array
65536 x 32768 65536 x 65535
Maximum width, height, and number
of layers for a 2D layered texture reference
8192 x 8192 x 512 16384 x 16384 x 2048
Maximum width, height and depth
for a 3D texture reference bound to linear
memory or a CUDA array
2048 x 2048 x 2048
Maximum number of textures that
can be bound to a kernel
128
Maximum width for a 1D surface
reference bound to a CUDA array
Not
supported
8192
Maximum width and height for a 2D
surface reference bound to a CUDA array
8192 x 8192
Maximum number of surfaces that
can be bound to a kernel
8
Maximum number of instructions per
kernel
2 million

Example

CudaArray* cu_array; texture< float , 2 >tex; // Allocate array cudaMalloc( & cu_array, cudaCreateChannelDesc< float>(), width, height) ; // Copy image data to array cudaMemcpy( cu_array, image, width* height, cudaMemcpyHostToDevice) ; // Bind the array to the texture cudaBindTexture( tex, cu_array) ; // Run kernel dim3 blockDim(16, 16, 1) ; dim3 gridDim(width / blockDim.x, height / blockDim.y, 1) ; kernel<<< gridDim, blockDim, 0 >>> (d_odata, width, height) ; cudaUnbindTexture(tex) ; __global__ void kernel(float * odata, int height, int width) ( unsigned int x = blockIdx.x * blockDim.x + threadIdx.x ; unsigned int y = blockIdx.y * blockDim.y + threadIdx.y ; float c = texfetch(tex, x, y) ; odata[ y* width+ x] = c;

Import pycuda.driver as drv import numpy drv.init() dev = drv.Device(0) ctx = dev.make_context() mod = drv.SourceModule( """ __global__ void multiply_them(float *dest, float *a, float *b) ( const int i = threadIdx.x; dest[i] = a[i] * b[i]; ) """) multiply_them = mod.get_function ("multiply_them" ) a = numpy.random .randn (400 ) .astype (numpy.float32 ) b = numpy.random .randn (400 ) .astype (numpy.float32 ) dest = numpy.zeros_like (a) multiply_them( drv.Out (dest) , drv.In (a) , drv.In (b) , block= (400 , 1 , 1 ) ) print dest-a*b

CUDA as a subject in universities

As of December 2009, the CUDA software model is taught in 269 universities around the world. In Russia, training courses on CUDA are given at the St. Petersburg Polytechnic University, Yaroslavl State University. P. G. Demidov, Moscow, Nizhny Novgorod, St. Petersburg, Tver, Kazan, Novosibirsk, Novosibirsk State Technical University, Omsk and Perm State Universities, International University of the Nature of Society and Man "Dubna", Ivanovo State Energy University, Belgorod State University, MSTU them. Bauman, Russian Chemical Technical University named after. Mendeleev, Interregional Supercomputer Center RAS, . In addition, in December 2009, it was announced that the first Russian scientific and educational center “Parallel Computing”, located in the city of Dubna, began operating, whose tasks include training and consultations on solving complex computing problems on GPUs.

In Ukraine, courses on CUDA are taught at the Kiev Institute of System Analysis.

Links

Official resources

  • CUDA Zone (Russian) - official CUDA website
  • CUDA GPU Computing (English) - official web forums dedicated to CUDA computing

Unofficial resources

Tom's Hardware
  • Dmitry Chekanov. nVidia CUDA: computing on a video card or the death of the CPU? . Tom's Hardware (June 22, 2008). Archived
  • Dmitry Chekanov. nVidia CUDA: Benchmarking GPU Applications for the Mass Market. Tom's Hardware (May 19, 2009). Archived from the original on March 4, 2012. Retrieved May 19, 2009.
iXBT.com
  • Alexey Berillo. NVIDIA CUDA - non-graphical computing on GPUs. Part 1 . iXBT.com (September 23, 2008). Archived from the original on March 4, 2012. Retrieved January 20, 2009.
  • Alexey Berillo. NVIDIA CUDA - non-graphical computing on GPUs. Part 2 . iXBT.com (October 22, 2008). - Examples of implementation of NVIDIA CUDA. Archived from the original on March 4, 2012. Retrieved January 20, 2009.
Other resources
  • Boreskov Alexey Viktorovich. CUDA Basics (January 20, 2009). Archived from the original on March 4, 2012. Retrieved January 20, 2009.
  • Vladimir Frolov. Introduction to CUDA technology. Online magazine “Computer Graphics and Multimedia” (December 19, 2008). Archived from the original on March 4, 2012. Retrieved October 28, 2009.
  • Igor Oskolkov. NVIDIA CUDA is an affordable ticket to the world of big computing. Computerra (April 30, 2009). Retrieved May 3, 2009.
  • Vladimir Frolov. Introduction to CUDA Technology (August 1, 2009). Archived from the original on March 4, 2012. Retrieved April 3, 2010.
  • GPGPU.ru. Using video cards for computing
  • . Parallel Computing Center

Notes

see also

I will talk about the key points of the CUDA compiler, the CUDA runtime API, and in conclusion, I will give an example of using CUDA for simple mathematical calculations.

Let's get started.

GPU computing model:

Let's look at the GPU computing model in more detail.

When using a GPU, you can use a grid of the required size and configure the blocks to suit the needs of your task.

CUDA and C language:

The CUDA technology itself (nvcc.exe compiler) introduces a number of additional extensions for the C language that are necessary for writing code for the GPU:
  1. Function specifiers, which indicate how and from where functions will be executed.
  2. Variable specifiers that serve to indicate the type of GPU memory used.
  3. GPU kernel launch specifiers.
  4. Built-in variables for identifying threads, blocks and other parameters when executing code in the GPU kernel.
  5. Additional variable types.
As mentioned, function specifiers determine how and from where functions will be called. There are a total of 3 such specifiers in CUDA:
  • __host__- executed on the CPU, called from the CPU (in principle, it doesn’t have to be specified).
  • __global__- executed on the GPU, called from the CPU.
  • __device__- executed on the GPU, called from the GPU.
Kernel startup specifiers are used to describe the number of blocks, threads, and memory you want to allocate when running on a GPU. The kernel startup syntax is as follows:

MyKernelFunc<<>>(float* param1,float* param2), where

  • gridSize– dimension of the block grid (dim3) allocated for calculations,
  • blockSize– size of the block (dim3) allocated for calculations,
  • sharedMemSize– the amount of additional memory allocated when the kernel starts,
  • cudaStream– the cudaStream_t variable, which specifies the stream in which the call will be made.
And of course myKernelFunc itself is a kernel function (__global__ specifier). Some variables can be omitted when calling the kernel, such as sharedMemSize and cudaStream.

It is also worth mentioning the built-in variables:

  • gridDim– grid dimension, type dim3. Allows you to find out the size of the grid allocated during the current kernel call.
  • blockDim– block dimension, also of type dim3. Allows you to find out the size of the block allocated during the current kernel call.
  • blockIdx– index of the current block in GPU calculations, type uint3.
  • threadIdx– index of the current thread in GPU calculations, type uint3.
  • warpSize– the size of the warp, it is of type int (I haven’t tried to use it myself yet).
By the way, gridDim and blockDim are the very variables that we pass when starting the GPU kernel, although in the kernel they can be read only.

Additional types of variables and their specifiers will be discussed directly in examples of working with memory.

CUDA host API:

Before you start using CUDA directly for computing, you need to familiarize yourself with the so-called CUDA host API, which is the link between the CPU and GPU. The CUDA host API can in turn be divided into a low-level API called the CUDA driver API, which provides access to the CUDA user-mode driver, and a high-level API, the CUDA runtime API. In my examples I will use the CUDA runtime API.

The CUDA runtime API includes the following groups of functions:

  • Device Management– includes functions for general GPU management (obtaining information about GPU capabilities, switching between GPUs when operating in SLI mode, etc.).
  • Thread Management– thread management.
  • Stream Management– flow control.
  • Event Management– function of creating and managing events.
  • Execution Control– functions for launching and executing the CUDA kernel.
  • Memory Management– GPU memory management functions.
  • Texture Reference Manager– working with texture objects via CUDA.
  • OpenGL Interoperability– functions for interacting with the OpenGL API.
  • Direct3D 9 Interoperability– functions for interaction with Direct3D 9 API.
  • Direct3D 10 Interoperability– functions for interaction with Direct3D 10 API.
  • Error Handling– error handling functions.

We understand the work of the GPU:

As was said, the thread is the direct performer of the calculations. How, then, does parallelization of calculations between threads occur? Let's consider the operation of a single block.

Task. It is required to calculate the sum of two vectors with dimension N elements.

We know the maximum size of our block: 512*512*64 threads. Since our vector is one-dimensional, for now we will limit ourselves to using the x-dimension of our block, that is, we will use only one strip of threads from the block (Fig. 3).

Note that the x-dimension of the block is 512, that is, we can add vectors whose length is N at a time<= 512 элементов. В прочем, при более массивных вычислениях, можно использовать большее число блоков и многомерные массивы. Так же я заметил одну интересную особенность, возможно, некоторые из вас подумали, что в одном блоке можно задействовать 512*512*64 = 16777216 нитей, естественно это не так, в целом, это произведение не может превышать 512 (по крайней мере, на моей видеокарте).

In the program itself, you must complete the following steps:

  1. Get data for calculations.
  2. Copy this data to GPU memory.
  3. Perform calculations on the GPU via a kernel function.
  4. Copy calculated data from GPU memory to RAM.
  5. View results.
  6. Free up used resources.
Let's move on to writing the code:

First of all, we’ll write a kernel function that will perform vector addition:

// Function of adding two vectors
__global__ void addVector(float * left, float * right, float * result)
{
//Get the id of the current thread.
int idx = threadIdx.x;

//Calculate the result.
result = left + right;
}


This way, parallelization will be done automatically when the kernel starts. This function also uses the built-in threadIdx variable and its x field, which allows you to set the correspondence between the calculation of a vector element and the thread in the block. We calculate each element of the vector in a separate thread.

We write the code that is responsible for points 1 and 2 in the program:

#define SIZE 512
__host__ int main()
{
//Allocate memory for vectors
float * vec1 = new float ;
float * vec2 = new float ;
float * vec3 = new float ;

//Initialize the values ​​of the vectors
for (int i = 0; i< SIZE; i++)
{
vec1[i] = i;
vec2[i] = i;
}

//Pointers to video card memory
float * devVec1;
float * devVec2;
float * devVec3;

//Allocate memory for vectors on the video card
cudaMalloc((void **)&devVec1, sizeof (float ) * SIZE);
cudaMalloc((void **)&devVec2, sizeof (float ) * SIZE);
cudaMalloc((void **)&devVec3, sizeof (float ) * SIZE);

//Copy data to video card memory
cudaMemcpy(devVec1, vec1, sizeof (float ) * SIZE, cudaMemcpyHostToDevice);
cudaMemcpy(devVec2, vec2, sizeof (float ) * SIZE, cudaMemcpyHostToDevice);

}


* This source code was highlighted with Source Code Highlighter.

To allocate memory on the video card, use the function cudaMalloc, which has the following prototype:
cudaError_t cudaMalloc(void** devPtr, size_t count), where

  1. devPtr– a pointer into which the address of the allocated memory is written,
  2. count– size of allocated memory in bytes.
Returns:
  1. cudaSuccess– upon successful memory allocation
  2. cudaErrorMemoryAllocation– in case of memory allocation error
To copy data to the video card memory, cudaMemcpy is used, which has the following prototype:
cudaError_t cudaMemcpy(void* dst, const void* src ,size_t count, enum cudaMemcpyKind kind), where
  1. dst– a pointer containing the address of the copying destination,
  2. src– a pointer containing the address of the copy source,
  3. count– size of the copied resource in bytes,
  4. cudaMemcpyKind– an enumeration indicating the copying direction (can be cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, cudaMemcpyHostToHost, cudaMemcpyDeviceToDevice).
Returns:
  1. cudaSuccess – if copying is successful
  2. cudaErrorInvalidValue – invalid argument parameters (for example, copy size is negative)
  3. cudaErrorInvalidDevicePointer – invalid memory pointer in the video card
  4. cudaErrorInvalidMemcpyDirection – incorrect direction (for example, the source and destination of the copy are mixed up)
Now we move on to directly calling the kernel for calculations on the GPU.

dim3 gridSize = dim3(1, 1, 1); //Size of grid used
dim3 blockSize = dim3(SIZE, 1, 1); //Size of block used


addVector<<>>(devVec1, devVec2, devVec3);


* This source code was highlighted with Source Code Highlighter.

In our case, it is not necessary to determine the size of the grid and block, since we are using only one block and one dimension in the block, so the code above can be written:
addVector<<<1, SIZE>>>(devVec1, devVec2, devVec3);

* This source code was highlighted with Source Code Highlighter.


Now all we have to do is copy the calculation result from video memory to the host memory. But kernel functions have a feature - asynchronous execution, that is, if after calling the kernel the next section of code starts working, this does not mean that the GPU has completed the calculations. To complete a given kernel function, it is necessary to use synchronization tools, such as events. Therefore, before copying the results to the host, we synchronize the GPU threads via event.

Code after calling the kernel:

//Make a kernel function call
addVector<<>>(devVec1, devVec2, devVec3);

//Event handle
cudaEvent_t syncEvent;

CudaEventCreate(&syncEvent); //Create event
cudaEventRecord(syncEvent, 0); //Write event
cudaEventSynchronize(syncEvent); //Synchronize event

//Only now we get the calculation result
cudaMemcpy(vec3, devVec3, sizeof (float ) * SIZE, cudaMemcpyDeviceToHost);


* This source code was highlighted with Source Code Highlighter.

Let's take a closer look at the functions from the Event Management API.

Event is created using the function cudaEventCreate, the prototype of which looks like:
cudaError_t cudaEventCreate(cudaEvent_t* event), where

  1. *event– pointer for recording the event handle.
Returns:
  1. cudaSuccess – if successful
  2. cudaErrorMemoryAllocation – memory allocation error
An event is recorded using the function cudaEventRecord, the prototype of which looks like:
cudaError_t cudaEventRecord(cudaEvent_t event, CUstream stream), where
  1. event– handle of the event being written,
  2. stream– the number of the stream in which we write (in our case this is the main zero current).
Returns:
  1. cudaSuccess – if successful
  2. cudaErrorInvalidValue – invalid value
  3. cudaErrorInitializationError – initialization error
  4. cudaErrorPriorLaunchFailure – error during the previous asynchronous launch of the function
Event synchronization is performed by the cudaEventSynchronize function. This function waits for all GPU threads to finish working and for the specified event to pass, and only then gives control to the calling program. The function prototype looks like:
cudaError_t cudaEventSynchronize(cudaEvent_t event), where
  1. event– handle of the event, the passage of which is expected.
Returns:
  1. cudaSuccess – if successful
  2. cudaErrorInitializationError – initialization error
  3. cudaErrorPriorLaunchFailure – error during the previous asynchronous launch of the function
  4. cudaErrorInvalidValue – invalid value
  5. cudaErrorInvalidResourceHandle – invalid event handle
You can understand how cudaEventSynchronize works from the following diagram:

In Figure 4, the block “Waiting for the Event to Pass” is a call to the cudaEventSynchronize function.

Well, in conclusion, we display the result on the screen and clean the allocated resources.

//Calculation results
for (int i = 0; i< SIZE; i++)
{
printf("Element #%i: %.1f\n" , i , vec3[i]);
}

//
// Release resources
//

CudaEventDestroy(syncEvent);

CudaFree(devVec1);
cudaFree(devVec2);
cudaFree(devVec3);

Delete vec1; vec1 = 0;
deletevec2; vec2 = 0;
deletevec3; vec3 = 0;


* This source code was highlighted with Source Code Highlighter.

I think there is no need to describe the resource release functions. Perhaps we can remind you that they also return cudaError_t values ​​if there is a need to check their operation.

Conclusion

I hope this material will help you understand how the GPU functions. I described the most important points that you need to know to work with CUDA. Try to write the addition of two matrices yourself, but do not forget about the hardware limitations of the video card.

P.S.: It didn’t turn out very briefly. I hope I didn't bore you. If you need all the source code, I can send it by email.
P.S.S: Ask questions.

Tags: Add tags

In the development of modern processors, there is a tendency towards a gradual increase in the number of cores, which increases their capabilities in parallel computing. However, GPUs have long been available that are significantly superior to CPUs in this regard. And these capabilities of GPUs have already been taken into account by some companies. The first attempts to use graphics accelerators for non-target computing have been made since the late 90s. But only the emergence of shaders became the impetus for the development of a completely new technology, and in 2003 the concept of GPGPU (General-purpose graphics processing units) appeared. An important role in the development of this initiative was played by BrookGPU, which is a special extension for the C language. Before the advent of BrookGPU, programmers could work with GPUs only through the Direct3D or OpenGL API. Brook allowed developers to work with a familiar environment, and the compiler itself, using special libraries, implemented interaction with the GPU at a low level.

Such progress could not help but attract the attention of the leaders of this industry - AMD and NVIDIA, who began developing their own software platforms for non-graphical computing on their video cards. No one knows better than GPU developers all the nuances and features of their products, which allows these same companies to optimize the software package for specific hardware solutions as efficiently as possible. Currently, NVIDIA is developing the CUDA (Compute Unified Device Architecture) platform; AMD calls a similar technology CTM (Close To Metal) or AMD Stream Computing. We will look at some of the capabilities of CUDA and evaluate in practice the computing capabilities of the G92 graphics chip of the GeForce 8800 GT video card.

But first, let’s look at some of the nuances of performing calculations using GPUs. Their main advantage is that the graphics chip is initially designed to execute multiple threads, while each core of a conventional CPU executes a stream of sequential instructions. Any modern GPU is a multiprocessor consisting of several computing clusters, with many ALUs in each. The most powerful modern GT200 chip consists of 10 such clusters, each of which has 24 stream processors. The tested GeForce 8800 GT video card based on the G92 chip has seven large computing units with 16 stream processors each. CPUs use SIMD SSE blocks for vector calculations (single instruction multiple data - one instruction is executed on multiple data), which requires transforming the data into 4 vectors. The GPU processes threads scalarly, i.e. one instruction is applied over several threads (SIMT - single instruction multiple threads). This saves developers from converting data into vectors, and allows arbitrary branching in streams. Each GPU compute unit has direct memory access. And the video memory bandwidth is higher, thanks to the use of several separate memory controllers (on the top-end G200 there are 8 64-bit channels) and high operating frequencies.

In general, in certain tasks when working with large amounts of data, GPUs are much faster than CPUs. Below you see an illustration of this statement:


The chart shows the dynamics of CPU and GPU performance growth since 2003. NVIDIA likes to cite this data as advertising in its documents, but they are only theoretical calculations and in reality the gap, of course, may turn out to be much smaller.

But be that as it may, there is a huge potential of GPUs that can be used, and which requires a specific approach to software development. All this is implemented in the CUDA hardware and software environment, which consists of several software levels - the high-level CUDA Runtime API and the low-level CUDA Driver API.


CUDA uses the standard C language for programming, which is one of its main advantages for developers. Initially, CUDA includes the BLAS (basic linear algebra package) and FFT (Fourier transform) libraries. CUDA also has the ability to interact with OpenGL or DirectX graphics APIs, the ability to develop at a low level, and is characterized by an optimized distribution of data streams between the CPU and GPU. CUDA calculations are performed simultaneously with graphics ones, unlike the similar AMD platform, where a special virtual machine is launched for calculations on the GPU. But such “cohabitation” is also fraught with errors if a large load is created by the graphics API while CUDA is running simultaneously - after all, graphical operations still have a higher priority. The platform is compatible with 32- and 64-bit operating systems Windows XP, Windows Vista, MacOS X and various versions of Linux. The platform is open and on the website, in addition to special drivers for the video card, you can download software packages CUDA Toolkit, CUDA Developer SDK, including a compiler, debugger, standard libraries and documentation.

As for the practical implementation of CUDA, for a long time this technology was used only for highly specialized mathematical calculations in the field of particle physics, astrophysics, medicine or forecasting changes in the financial market, etc. But this technology is gradually becoming closer to ordinary users, in particular, special plug-ins for Photoshop are appearing that can use the computing power of the GPU. On a special page you can study the entire list of programs that use the capabilities of NVIDIA CUDA.

As a practical test of the new technology on the MSI NX8800GT-T2D256E-OC video card, we will use the TMPGEnc program. This product is commercial (the full version costs $100), but for MSI video cards it comes as a bonus in a trial version for a period of 30 days. You can download this version from the developer’s website, but to install TMPGEnc 4.0 XPress MSI Special Edition you need the original disk with drivers from the MSI card - without it the program will not be installed.

To display the most complete information about computing capabilities in CUDA and compare them with other video adapters, you can use the special CUDA-Z utility. This is the information it gives about our GeForce 8800GT video card:




Compared to the reference models, our copy operates at higher frequencies: the raster domain is 63 MHz higher than the nominal, and the shader units are faster by 174 MHz, and the memory is 100 MHz faster.

We will compare the conversion speed of the same HD video when calculating only using the CPU and with additional activation of CUDA in the TMPGEnc program on the following configuration:

  • Processor: Pentium Dual-Core E5200 2.5 GHz;
  • Motherboard: Gigabyte P35-S3;
  • Memory: 2x1GB GoodRam PC6400 (5-5-5-18-2T)
  • Video card: MSI NX8800GT-T2D256E-OC;
  • Hard drive: 320GB WD3200AAKS;
  • Power supply: CoolerMaster eXtreme Power 500-PCAP;
  • Operating system: Windows XP SP2;
  • TMPGEnc 4.0 XPress 4.6.3.268;
  • Video card drivers: ForceWare 180.60.
For tests, the processor was overclocked to 3 GHz (in the 11.5x261 MHz configuration) and to 4 GHz (11.5x348 MHz) with a RAM frequency of 835 MHz in the first and second cases. Video in Full HD 1920x1080 resolution, one minute and twenty seconds long. To create additional load, a noise reduction filter was turned on, the settings of which were left at default.


Encoding was carried out using the DivX 6.8.4 codec. In the quality settings of this codec, all values ​​are left at default, multithreading is enabled.


Multithreading support in TMPGEnc is initially enabled in the CPU/GPU settings tab. CUDA is also activated in the same section.


As you can see from the above screenshot, filter processing using CUDA is enabled, but the hardware video decoder is not enabled. The program documentation warns that activating the last parameter increases the file processing time.

Based on the results of the tests, the following data was obtained:


At 4 GHz with CUDA enabled, we only gained a couple of seconds (or 2%), which isn't particularly impressive. But at a lower frequency, the increase from activating this technology allows you to save about 13% of time, which will be quite noticeable when processing large files. But still the results are not as impressive as expected.

The TMPGEnc program has a CPU and CUDA load indicator; in this test configuration, it showed the CPU load at about 20%, and the graphics core at the remaining 80%. As a result, we have the same 100% as when converting without CUDA, and there may not be a time difference at all (but it still exists). The small memory capacity of 256 MB is also not a limiting factor. Judging by the readings from RivaTuner, no more than 154 MB of video memory was used during operation.



conclusions

The TMPGEnc program is one of those that introduces CUDA technology to the masses. Using the GPU in this program allows you to speed up the video processing process and significantly relieve the central processor, which will allow the user to comfortably do other tasks at the same time. In our specific example, the GeForce 8800GT 256MB video card slightly improved the timing performance when converting video based on an overclocked Pentium Dual-Core E5200 processor. But it is clearly visible that as the frequency decreases, the gain from activating CUDA increases; on weak processors, the gain from its use will be much greater. Against the background of this dependence, it is quite logical to assume that even with an increase in load (for example, the use of a very large number of additional video filters), the results of a system with CUDA will be distinguished by a more significant delta of the difference in the time spent on the encoding process. Also, do not forget that the G92 is not the most powerful chip at the moment, and more modern video cards will provide significantly higher performance in such applications. However, while the application is running, the GPU is not fully loaded and, probably, the load distribution depends on each configuration separately, namely on the processor/video card combination, which ultimately can give a larger (or smaller) increase as a percentage of CUDA activation. In any case, for those who work with large volumes of video data, this technology will still allow them to significantly save their time.

True, CUDA has not yet gained widespread popularity; the quality of software working with this technology requires improvement. In the TMPGEnc 4.0 XPress program we reviewed, this technology did not always work. The same video could be re-encoded several times, and then suddenly, the next time it was launched, the CUDA load was already 0%. And this phenomenon was completely random on completely different operating systems. Also, the program in question refused to use CUDA when encoding into the XviD format, but there were no problems with the popular DivX codec.

As a result, so far CUDA technology can significantly increase the performance of personal computers only in certain tasks. But the scope of application of such technology will expand, and the process of increasing the number of cores in conventional processors indicates an increase in the demand for parallel multi-threaded computing in modern software applications. It’s not for nothing that recently all industry leaders have become obsessed with the idea of ​​combining CPU and GPU within one unified architecture (just remember the much-advertised AMD Fusion). Perhaps CUDA is one of the stages in the process of this unification.


We thank the following companies for providing test equipment:

And others. However, searching for the combination “CUDA scan” produced only 2 articles that are in no way related to the scan algorithm on the GPU itself - and this is one of the most basic algorithms. Therefore, inspired by the course I just watched on Udacity - Intro to Parallel Programming, I decided to write a more complete series of articles about CUDA. I’ll say right away that the series will be based on this course, and if you have time, it will be much more useful to take it. The following articles are currently planned:
Part 1: Introduction.
Part 2: GPU Hardware and Parallel Communication Patterns.
Part 3: Fundamental GPU algorithms: reduce, scan and histogram.
Part 4: Fundamental GPU algorithms: compact, segmented scan, sort. Practical application of some algorithms.
Part 5: Optimizing GPU programs.
Part 6: Examples of parallelization of sequential algorithms.
Part 7: Additional topics in parallel programming, dynamic parallelism.

Latency vs throughput

The first question that everyone should ask before using a GPU to solve their problems is what purposes is a GPU good for, and when should it be used? To answer, you need to define 2 concepts:
Delay(latency) - time spent on executing one instruction/operation.
Bandwidth- the number of instructions/operations performed per unit of time.
A simple example: we have a passenger car with a speed of 90 km/h and a capacity of 4 people, and a bus with a speed of 60 km/h and a capacity of 20 people. If we take the movement of 1 person per 1 kilometer as an operation, then the delay of a passenger car is 3600/90=40s - in so many seconds 1 person will cover a distance of 1 kilometer, the vehicle’s capacity is 4/40=0.1 operations/second; bus delay - 3600/60=60s, bus throughput - 20/60=0.3(3) operations/second.
So, the CPU is the car, the GPU is the bus: it has high latency but also high throughput. If for your task the latency of each specific operation is not as important as the number of these operations per second, it is worth considering the use of a GPU.

Basic concepts and terms of CUDA

So, let's understand the CUDA terminology:

  • Device- GPU. Acts as a “slave” - it does only what the CPU tells it to do.
  • Host- CPU. Performs a management role - runs tasks on the device, allocates memory on the device, moves memory to/from the device. And yes, using CUDA assumes that both the device and the host have their own separate memory.
  • Kernel- a task launched by the host on the device.
When using CUDA, you simply write code in your favorite programming language (the list of supported languages, excluding C and C++), and then the CUDA compiler will generate code separately for the host and separately for the device. A small caveat: the code for the device must be written in C only with some "CUDA extensions".

Main stages of a CUDA program

  1. The host allocates the required amount of memory on the device.
  2. The host copies data from its memory to the device's memory.
  3. The host starts execution of certain kernels on the device.
  4. The device runs kernels.
  5. The host copies the results from the device memory to its memory.
Naturally, for the most efficient use of the GPU, it is necessary that the ratio of the time spent on cores to the time spent on memory allocation and data movement be as large as possible.

Cores

Let's take a closer look at the process of writing code for kernels and launching them. An important principle is kernels are written like (virtually) regular sequential programs- that is, you will not see the creation and launch of threads in the code of the kernels themselves. Instead, to organize parallel computing The GPU will run a large number of copies of the same kernel in different threads- or rather, you yourself say how many threads to run. And yes, getting back to the issue of GPU efficiency - the more threads you run (assuming they all do useful work) the better.
Kernel code differs from regular sequential code in the following ways:
  1. Inside the kernels, you have the opportunity to find out the “identifier” or, more simply, the position of the thread that is currently running - using this position we ensure that the same kernel will work with different data depending on the thread in which it is running. By the way, this organization of parallel computing is called SIMD (Single Instruction Multiple Data) - when several processors simultaneously perform the same operation but on different data.
  2. In some cases, kernel code needs to use different synchronization methods.
How do we set the number of threads in which the kernel will be launched? Because the GPU is still Graphics Processing Unit, then this naturally influenced the CUDA model, namely the way the number of threads is set:
  • First, the dimensions of the so-called grid are set, in 3D coordinates: grid_x, grid_y, grid_z. As a result, the grid will consist of grid_x*grid_y*grid_z blocks.
  • Then the block dimensions are set in 3D coordinates: block_x, block_y, block_z. As a result, the block will consist of block_x*block_y*block_z streams. In total, we have grid_x*grid_y*grid_z*block_x*block_y*block_z streams. Important note - the maximum number of threads per block is limited and depends on the GPU model - typical values ​​are 512 (older models) and 1024 (newer models).
  • Variables available inside the kernel threadIdx And blockIdx with margins x, y, z- they contain 3D coordinates of the flow in the block and the block in the grid, respectively. Variables are also available blockDim And gridDim with the same fields - block and grid sizes, respectively.
As you can see, this method of launching threads is really suitable for processing 2D and 3D images: for example, if you need to process each pixel of a 2D or 3D image in a certain way, then after choosing the block sizes (depending on the size of the image, the processing method and the GPU model), the grid sizes are chosen such that the entire image is covered, possibly in excess - if the image dimensions are not evenly divided by the block dimensions.

We are writing the first program on CUDA

Enough theory, time to write code. Instructions for installing and configuring CUDA for different OS - docs.nvidia.com/cuda/index.html. Also, for ease of working with image files, we will use OpenCV, and to compare the performance of CPU and GPU - OpenMP.
Let's set a fairly simple task: converting a color image to grayscale. To do this, the pixel brightness pix in the gray scale it is calculated according to the formula: Y = 0.299*pix.R + 0.587*pix.G + 0.114*pix.B.
First, let's write the skeleton of the program:

main.cpp

#include #include #include #include #include #include #include #include #include "openMP.hpp" #include "CUDA_wrappers.hpp" #include "common/image_helpers.hpp" using namespace cv; using namespace std; int main(int argc, char** argv) ( using namespace std::chrono; if(argc != 2) ( cout<<" Usage: convert_to_grayscale imagefile" << endl; return -1; } Mat image, imageGray; uchar4 *imageArray; unsigned char *imageGrayArray; prepareImagePointers(argv, image, &imageArray, imageGray, &imageGrayArray, CV_8UC1); int numRows = image.rows, numCols = image.cols; auto start = system_clock::now(); RGBtoGrayscaleOpenMP(imageArray, imageGrayArray, numRows, numCols); auto duration = duration_cast(system_clock::now() - start); cout<<"OpenMP time (ms):" << duration.count() << endl; memset(imageGrayArray, 0, sizeof(unsigned char)*numRows*numCols); RGBtoGrayscaleCUDA(imageArray, imageGrayArray, numRows, numCols); return 0; }


Everything here is quite obvious - we read the file with the image, prepare pointers to the color and grayscale images, run the option
with OpenMP and the option with CUDA, we measure the time. Function prepareImagePointers has the following form:

prepareImagePointers

template void prepareImagePointers(const char * const inputImageFileName, cv::Mat& inputImage, T1** inputImageArray, cv::Mat& outputImage, T2** outputImageArray, const int outputImageType) ( using namespace std; using namespace cv; inputImage = imread(inputImageFileName, IMREAD_COLOR); if (inputImage.empty()) ( cerr<< "Couldn"t open input file." << endl; exit(1); } //allocate memory for the output outputImage.create(inputImage.rows, inputImage.cols, outputImageType); cvtColor(inputImage, inputImage, cv::COLOR_BGR2BGRA); *inputImageArray = (T1*)inputImage.ptr(0); *outputImageArray = (T2*)outputImage.ptr (0); }


I used a little trick: the fact is that we do very little work for each pixel of the image - that is, with the CUDA option, the above-mentioned problem arises of the ratio of the time to perform useful operations to the time of memory allocation and data copying, and as a result the total time The CUDA version will be larger than the OpenMP version, but we want to show that CUDA is faster :) Therefore, for CUDA only the time spent on performing the actual image conversion will be measured - without taking into account memory operations. In my defense, I will say that for a large class of tasks, the useful time will still dominate, and CUDA will be faster even taking into account memory operations.
Next we will write the code for the OpenMP option:

openMP.hpp

#include #include #include void RGBtoGrayscaleOpenMP(uchar4 *imageArray, unsigned char *imageGrayArray, int numRows, int numCols) ( #pragma omp parallel for collapse(2) for (int i = 0; i< numRows; ++i) { for (int j = 0; j < numCols; ++j) { const uchar4 pixel = imageArray; imageGrayArray = 0.299f*pixel.x + 0.587f*pixel.y+0.114f*pixel.z; } } }


Everything is pretty straightforward - we just added a directive omp parallel for to single-threaded code - this is the beauty and power of OpenMP. I tried to play with the parameter schedule, but it only turned out worse than without it.
Finally, we move on to CUDA. Here we will write in more detail. First you need to allocate memory for the input data, move it from the CPU to the GPU and allocate memory for the output data:

Hidden text

void RGBtoGrayscaleCUDA(const uchar4 * const h_imageRGBA, unsigned char* const h_imageGray, size_t numRows, size_t numCols) ( uchar4 *d_imageRGBA; unsigned char *d_imageGray; const size_t numPixels = numRows * numCols; cudaSetDevice(0); checkCudaErrors(cudaGet); LastError()) ; //allocate memory on the device for both input and output checkCudaErrors(cudaMalloc(&d_imageRGBA, sizeof(uchar4) * numPixels)); //copy input array to the GPU checkCudaErrors(cudaMemcpy(d_imageRGBA, h_imageRGBA, sizeof(uchar4) * numPixels, cudaMemcpyHostToDevice));


It is worth paying attention to the standard for naming variables in CUDA - data on the CPU starts with h_ (h ost), data yes GPU - with d_ (d device). checkCudaErrors- macro, taken from the Udacity github repository of the course. Has the following form:

Hidden text

#include #define checkCudaErrors(val) check((val), #val, __FILE__, __LINE__) template void check(T err, const char* const func, const char* const file, const int line) ( if (err != cudaSuccess) ( std::cerr<< "CUDA error at: " << file << ":" << line << std::endl; std::cerr << cudaGetErrorString(err) << " " << func << std::endl; exit(1); } }


cudaMalloc- analog malloc for GPU, cudaMemcpy- analog memcpy, has an additional parameter in the form of an enum, which indicates the type of copying: cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, cudaMemcpyDeviceToDevice.
Next, you need to set the dimensions of the grid and block and call the kernel, not forgetting to measure the time:

Hidden text

dim3 blockSize; dim3 gridSize; int threadNum; cudaEvent_t start, stop; cudaEventCreate(&start); cudaEventCreate(&stop); threadNum = 1024; blockSize = dim3(threadNum, 1, 1); gridSize = dim3(numCols/threadNum+1, numRows, 1); cudaEventRecord(start); rgba_to_grayscale_simple<<>>(d_imageRGBA, d_imageGray, numRows, numCols); cudaEventRecord(stop); cudaEventSynchronize(stop); cudaDeviceSynchronize(); checkCudaErrors(cudaGetLastError()); float milliseconds = 0; cudaEventElapsedTime(&milliseconds, start, stop); std::cout<< "CUDA time simple (ms): " << milliseconds << std::endl;


Pay attention to the kernel call format - kernel_name<<>> . The kernel code itself is also not very complicated:

rgba_to_grayscale_simple

Global__ void rgba_to_grayscale_simple(const uchar4* const d_imageRGBA, unsigned char* const d_imageGray, int numRows, int numCols) ( int y = blockDim.y*blockIdx.y + threadIdx.y; int x = blockDim.x*blockIdx.x + threadIdx .x; if (x>=numCols || y>=numRows) return; const int offset = y*numCols+x; const uchar4 pixel = d_imageGray = 0.299f*pixel.x + 0.587f*pixel.y+ 0.114f*pixel.z )


Here we calculate the coordinates y And x processed pixel using the previously described variables threadIdx, blockIdx And blockDim, well, we perform the conversion. Please pay attention to check if (x>=numCols || y>=numRows)- since the image dimensions will not necessarily be divided evenly by the block sizes, some blocks may “go beyond” the image - therefore this check is necessary. Also, the kernel function must be marked with the specifier __global__.
The last step is to copy the result back from the GPU to the CPU and free the allocated memory:

Hidden text

checkCudaErrors(cudaMemcpy(h_imageGray, d_imageGray, sizeof(unsigned char) * numPixels, cudaMemcpyDeviceToHost)); cudaFree(d_imageGray); cudaFree(d_imageRGBA);


By the way, CUDA allows you to use a C++ compiler for host code - so you can easily write wrappers to automatically free memory.
So, let's run, measure (the size of the input image is 10.109 × 4.542):
OpenMP time (ms):45 CUDA time simple (ms): 43.1941
The configuration of the machine on which the tests were carried out:

Hidden text

Processor: Intel® Core(TM) i7-3615QM CPU @ 2.30GHz.
GPU: NVIDIA GeForce GT 650M, 1024 MB, 900 MHz.
RAM: DD3, 2x4GB, 1600 MHz.
OS: OS X 10.9.5.
Compiler: g++ (GCC) 4.9.2 20141029.
CUDA compiler: Cuda compilation tools, release 6.0, V6.0.1.
Supported OpenMP version: OpenMP 4.0.


It turned out somehow not very impressive :) But the problem is still the same - too little work is done on each pixel - we launch thousands of threads, each of which works almost instantly. In the case of a CPU, such a problem does not arise - OpenMP will launch a relatively small number of threads (8 in my case) and divide the work between them equally - thus the processors will be busy almost 100%, while with the GPU we, in fact, We don’t use all its power. The solution is pretty obvious - process multiple pixels in the kernel. The new, optimized kernel will look like this:

rgba_to_grayscale_optimized

#define WARP_SIZE 32 __global__ void rgba_to_grayscale_optimized(const uchar4* const d_imageRGBA, unsigned char* const d_imageGray, int numRows, int numCols, int elemsPerThread) ( int y = blockDim.y*blockIdx.y + threadIdx.y; int x = blockDim. x*blockIdx.x + threadIdx.x; const int loop_start = (x/WARP_SIZE * WARP_SIZE)*(elemsPerThread-1)+x; for (int i=loop_start, j=0; j


It's not as simple as with the previous kernel. If you look at it, now each thread will process elemsPerThread pixels, and not in a row, but with a distance of WARP_SIZE between them. What WARP_SIZE is, why it is equal to 32, and why we process pixels side by side will be discussed in more detail in the following parts, for now I will only say that by doing this we achieve more efficient work with memory. Each thread now processes elemsPerThread pixels with a distance of WARP_SIZE between them, so the x-coordinate of the first pixel for this stream based on its position in the block is now calculated using a slightly more complex formula than before.
This kernel is launched as follows:

Hidden text

threadNum=128; const int elemsPerThread = 16; blockSize = dim3(threadNum, 1, 1); gridSize = dim3(numCols / (threadNum*elemsPerThread) + 1, numRows, 1); cudaEventRecord(start); rgba_to_grayscale_optimized<<>>(d_imageRGBA, d_imageGray, numRows, numCols, elemsPerThread); cudaEventRecord(stop); cudaEventSynchronize(stop); cudaDeviceSynchronize(); checkCudaErrors(cudaGetLastError()); milliseconds = 0; cudaEventElapsedTime(&milliseconds, start, stop); std::cout<< "CUDA time optimized (ms): " << milliseconds << std::endl;


The number of blocks by x-coordinate is now calculated as numCols / (threadNum*elemsPerThread) + 1 instead of numCols/threadNum + 1. Otherwise everything remains the same.
Let's launch:
OpenMP time (ms):44 CUDA time simple (ms): 53.1625 CUDA time optimized (ms): 15.9273
We received a speed increase of 2.76 times (again, not taking into account the time for memory operations) - for such a simple problem this is quite good. Yes, yes, this task is too simple - the CPU copes with it quite well. As can be seen from the second test, a simple GPU implementation may even be inferior in speed to a CPU implementation.
That's all for today, in the next part we will look at GPU hardware and basic parallel communication patterns.
All source code is available on bitbucket.

Tags: Add tags

Devices for turning personal computers into small supercomputers have been known for quite some time. Back in the 80s of the last century, so-called transputers were offered on the market, which were inserted into the then common ISA expansion slots. At first, their performance in relevant tasks was impressive, but then the growth in the speed of universal processors accelerated, they strengthened their position in parallel computing, and there was no point in transputers. Although similar devices still exist today - these are a variety of specialized accelerators. But often the scope of their application is narrow and such accelerators are not particularly widespread.

But recently, the baton of parallel computing has passed to the mass market, one way or another connected with 3D games. General-purpose devices with multi-core processors for parallel vector calculations used in 3D graphics achieve high peak performance that general-purpose processors cannot match. Of course, maximum speed is achieved only in a number of convenient tasks and has some limitations, but such devices have already begun to be used quite widely in areas for which they were not originally intended. An excellent example of such a parallel processor is the Cell processor, developed by the Sony-Toshiba-IBM alliance and used in the Sony PlayStation 3 game console, as well as all modern video cards from market leaders - Nvidia and AMD.

We won’t touch Cell today, although it appeared earlier and is a universal processor with additional vector capabilities, we’re not talking about it today. For 3D video accelerators, a few years ago the first technologies for non-graphical general-purpose computations GPGPU (General-Purpose computation on GPUs) appeared. After all, modern video chips contain hundreds of mathematical execution units, and this power can be used to significantly accelerate many computationally intensive applications. And current generations of GPUs have a fairly flexible architecture, which, together with high-level programming languages ​​and hardware-software architectures like the one discussed in this article, reveals these capabilities and makes them much more accessible.

The creation of GPCPU was prompted by the emergence of fairly fast and flexible shader programs that can be executed by modern video chips. The developers decided to make GPUs calculate not only images in 3D applications, but also be used in other parallel calculations. In GPGPU, graphics APIs were used for this: OpenGL and Direct3D, when data was transferred to the video chip in the form of textures, and calculation programs were loaded in the form of shaders. The disadvantages of this method are the relatively high complexity of programming, the low speed of data exchange between the CPU and GPU, and other limitations, which we will discuss later.

GPU computing has developed and is developing very quickly. And subsequently, two major video chip manufacturers, Nvidia and AMD, developed and announced corresponding platforms called CUDA (Compute Unified Device Architecture) and CTM (Close To Metal or AMD Stream Computing), respectively. Unlike previous GPU programming models, these were designed with direct access to the hardware capabilities of the video cards. The platforms are not compatible with each other, CUDA is an extension of the C programming language, and CTM is a virtual machine that executes assembly code. But both platforms eliminated some of the important limitations of previous GPGPU models that used a traditional graphics pipeline and the corresponding Direct3D or OpenGL interfaces.

Of course, open standards using OpenGL seem to be the most portable and universal, allowing the same code to be used across video chips from different manufacturers. But such methods have a lot of disadvantages; they are much less flexible and not so convenient to use. In addition, they do not allow the use of specific capabilities of certain video cards, such as fast shared (common) memory present in modern computing processors.

That is why Nvidia released the CUDA platform, a C-like programming language with its own compiler and libraries for GPU computing. Of course, writing optimal code for video chips is not at all so simple and this task requires a lot of manual work, but CUDA reveals all the possibilities and gives the programmer more control over the hardware capabilities of the GPU. It is important that Nvidia CUDA support is available in the G8x, G9x and GT2xx chips used in Geforce 8, 9 and 200 series video cards, which are very widespread. The final version of CUDA 2.0 has now been released, which brings some new features, such as support for double precision calculations. CUDA is available on 32-bit and 64-bit Linux, Windows and MacOS X operating systems.

Difference between CPU and GPU in parallel calculations

The increase in frequencies of universal processors has come up against physical limitations and high power consumption, and their performance is increasingly being increased by placing several cores on one chip. The processors currently sold contain only up to four cores (further growth will not be rapid) and they are designed for general applications, using MIMD multiple command and data stream. Each core operates separately from the others, executing different instructions for different processes.

Specialized vector capabilities (SSE2 and SSE3) for four-component (single-precision floating-point) and two-component (double-precision) vectors appeared in general-purpose processors due to the increased demands of graphics applications, primarily. That is why for certain tasks the use of GPUs is more profitable, because they were originally made for them.

For example, in Nvidia video chips the main unit is a multiprocessor with eight to ten cores and hundreds of ALUs in total, several thousand registers and a small amount of shared memory. In addition, the video card contains fast global memory with access to it by all multiprocessors, local memory in each multiprocessor, and special memory for constants.

Most importantly, these multiple multiprocessor cores in the GPU are SIMD (single instruction stream, multiple data stream) cores. And these cores execute the same instructions simultaneously, a style of programming that is common for graphics algorithms and many scientific tasks, but requires specific programming. But this approach allows you to increase the number of execution units due to their simplification.

So, let's list the main differences between CPU and GPU architectures. CPU cores are designed to execute a single stream of sequential instructions at maximum performance, while GPUs are designed to quickly execute large numbers of parallel instruction streams. General purpose processors are optimized to achieve high performance from a single instruction stream, processing both integer and floating point numbers. In this case, access to memory is random.

CPU developers try to execute as many instructions as possible in parallel to increase performance. To achieve this, starting with Intel Pentium processors, superscalar execution appeared, ensuring the execution of two instructions per clock cycle, and the Pentium Pro was distinguished by out-of-order execution of instructions. But the parallel execution of a sequential stream of instructions has certain basic limitations and increasing the number of execution units cannot achieve a multiple increase in speed.

Video chips have a simple and parallel operation from the start. The video chip takes a group of polygons as an input, carries out all the necessary operations, and produces pixels as an output. Processing of polygons and pixels is independent; they can be processed in parallel, separately from each other. Therefore, due to the inherently parallel organization of work, the GPU uses a large number of execution units, which are easy to load, in contrast to the sequential stream of instructions for the CPU. In addition, modern GPUs can also execute more than one instruction per clock cycle (dual issue). Thus, the Tesla architecture, under certain conditions, launches MAD+MUL or MAD+SFU operations simultaneously.

The GPU also differs from the CPU in terms of memory access principles. In the GPU it is coherent and easily predictable - if a texture texel is read from memory, then after a while the time will come for neighboring texels. And when recording, the same thing happens - a pixel is written to the framebuffer, and after a few clock cycles the one located next to it will be recorded. Therefore, the memory organization is different from that used in the CPU. And the video chip, unlike universal processors, simply does not need a large cache memory, and textures require only a few (up to 128-256 in current GPUs) kilobytes.

And the work with memory itself is somewhat different for GPUs and CPUs. So, not all central processors have built-in memory controllers, and all GPUs usually have several controllers, up to eight 64-bit channels in the Nvidia GT200 chip. In addition, video cards use faster memory, and as a result, video chips have access to significantly greater memory bandwidth, which is also very important for parallel calculations that operate with huge data streams.

In universal processors, large numbers of transistors and chip area are used for instruction buffers, hardware branch prediction, and huge amounts of on-chip cache memory. All these hardware blocks are needed to speed up the execution of a few command streams. Video chips spend transistors on arrays of execution units, flow control units, small shared memory and memory controllers for several channels. The above does not speed up the execution of individual threads, it allows the chip to handle several thousand threads simultaneously executed by the chip and requiring high memory bandwidth.

About the differences in caching. General purpose CPUs use cache memory to increase performance by reducing memory access latency, while GPUs use cache or shared memory to increase bandwidth. CPUs reduce memory access latency by using large caches and code branch prediction. These hardware pieces take up most of the chip's area and consume a lot of power. Video chips bypass the problem of memory access delays by executing thousands of threads simultaneously - while one of the threads is waiting for data from memory, the video chip can perform calculations by another thread without waiting or delays.

There are also many differences in multithreading support. The CPU executes 1-2 threads of calculations per processor core, and video chips can support up to 1024 threads per multiprocessor, of which there are several on the chip. And if switching from one thread to another costs hundreds of clock cycles for the CPU, then the GPU switches several threads in one clock cycle.

In addition, CPUs use SIMD (single instruction over multiple data) units for vector calculations, and GPUs use SIMT (single instruction multiple threads) for scalar thread processing. SIMT does not require the developer to convert data into vectors, and allows arbitrary branching in streams.

In short, we can say that, unlike modern universal CPUs, video chips are designed for parallel calculations with a large number of arithmetic operations. And a significantly larger number of GPU transistors work for their intended purpose - processing data arrays, and do not control the execution (flow control) of a few sequential computational threads. This is a diagram of how much space various logic takes up on the CPU and GPU:

As a result, the basis for effectively using the power of the GPU in scientific and other non-graphical calculations is the parallelization of algorithms across hundreds of execution units available on video chips. For example, many molecular modeling applications are well suited for calculations on video chips; they require large computing power and are therefore convenient for parallel computing. And the use of multiple GPUs provides even more computing power to solve such problems.

Performing calculations on the GPU shows excellent results in algorithms that use parallel data processing. That is, when the same sequence of mathematical operations is applied to a large amount of data. In this case, the best results are achieved if the ratio of the number of arithmetic instructions to the number of memory accesses is sufficiently large. This places less demands on flow control, and the high density of mathematics and large volume of data eliminates the need for large caches, as on the CPU.

As a result of all the differences described above, the theoretical performance of video chips significantly exceeds the performance of the CPU. Nvidia provides the following graph of CPU and GPU performance growth over the past few years:

Naturally, these data are not without a share of guile. After all, on a CPU it is much easier in practice to achieve theoretical figures, and the figures are given for single precision in the case of a GPU, and for double precision in the case of a CPU. In any case, for some parallel tasks single precision is enough, and the difference in speed between universal and graphics processors is very large, and therefore the game is worth the candle.

First attempts to use GPU calculations

They have been trying to use video chips in parallel mathematical calculations for quite some time. The very first attempts at such applications were extremely primitive and were limited to the use of certain hardware functions, such as rasterization and Z-buffering. But in the current century, with the advent of shaders, matrix calculations began to be accelerated. In 2003, at SIGGRAPH, a separate section was allocated for GPU computing, and it was called GPGPU (General-Purpose computation on GPU).

The best known is BrookGPU Brook stream programming language compiler, designed to perform non-graphical computations on the GPU. Before its appearance, developers using the capabilities of video chips for calculations chose one of two common APIs: Direct3D or OpenGL. This seriously limited the use of GPUs, because 3D graphics use shaders and textures that parallel programming specialists are not required to know about; they use threads and cores. Brook was able to help make their task easier. These streaming extensions to the C language, developed at Stanford University, hid the 3D API from programmers, and presented the video chip as a parallel coprocessor. The compiler processed the .br file with C++ code and extensions, producing code linked to a DirectX, OpenGL, or x86-enabled library.

Naturally, Brook had many shortcomings, which we dwelled on, and which we will discuss in more detail later. But even just its appearance caused a significant surge of attention from the same Nvidia and ATI to the initiative of computing on GPUs, since the development of these capabilities seriously changed the market in the future, opening up a whole new sector of it - parallel computers based on video chips.

Subsequently, some researchers from the Brook project joined Nvidia's development team to introduce a hardware-software parallel computing strategy, opening up new market share. And the main advantage of this Nvidia initiative is that developers know all the capabilities of their GPUs down to the last detail, and there is no need to use the graphics API, and you can work with the hardware directly using the driver. The result of the efforts of this team was Nvidia CUDA (Compute Unified Device Architecture) a new software and hardware architecture for parallel computing on Nvidia GPU, which is the subject of this article.

Areas of application of parallel calculations on GPU

To understand the benefits of transferring calculations to video chips, here are the average figures obtained by researchers around the world. On average, when transferring calculations to the GPU, many tasks achieve speedups of 5-30 times compared to fast general-purpose processors. The largest numbers (on the order of 100x speedup or even more!) are achieved with code that is not very suitable for calculations using SSE blocks, but is quite convenient for GPUs.

These are just some examples of speedups for synthetic code on the GPU versus SSE vectorized code on the CPU (according to Nvidia):

  • Fluorescence microscopy: 12x;
  • Molecular dynamics (non-bonded force calc): 8-16x;
  • Electrostatics (direct and multilevel Coulomb summation): 40-120x and 7x.

And this is a sign that Nvidia loves very much, showing it at all presentations, which we will dwell on in more detail in the second part of the article, dedicated to specific examples of practical applications of CUDA computing:

As you can see, the numbers are very attractive, the 100-150-fold increases are especially impressive. In the next article on CUDA, we will look at some of these numbers in detail. Now let's list the main applications in which GPU computing is currently used: analysis and processing of images and signals, physics simulation, computational mathematics, computational biology, financial calculations, databases, dynamics of gases and liquids, cryptography, adaptive radiation therapy, astronomy, processing sound, bioinformatics, biological simulations, computer vision, data mining, digital cinema and television, electromagnetic simulations, geographic information systems, military applications, mine planning, molecular dynamics, magnetic resonance imaging (MRI), neural networks, oceanographic research, particle physics, protein folding simulation, quantum chemistry, ray tracing, visualization, radar, reservoir simulation, artificial intelligence, satellite data analysis, seismic exploration, surgery, ultrasound, video conferencing.

Details about many applications can be found on the Nvidia website in the section on. As you can see, the list is quite large, but that’s not all! It can be continued, and we can certainly assume that in the future other areas of application of parallel calculations on video chips will be found, which we are not yet aware of.

Nvidia CUDA Features

CUDA technology is an Nvidia hardware-software computing architecture based on an extension of the C language, which makes it possible to organize access to the instruction set of a graphics accelerator and manage its memory when organizing parallel computing. CUDA helps implement algorithms executable on eighth-generation and older GeForce video accelerators (Geforce 8, Geforce 9, Geforce 200 series), as well as Quadro and Tesla.

Although the complexity of programming GPUs using CUDA is quite high, it is lower than with earlier GPGPU solutions. Such programs require dividing the application between multiple multiprocessors, similar to MPI programming, but without sharing the data that is stored in shared video memory. And since CUDA programming for each multiprocessor is similar to OpenMP programming, it requires a good understanding of memory organization. But, of course, the complexity of developing and porting to CUDA greatly depends on the application.

The developer kit contains many code examples and is well documented. The learning process will take about two to four weeks for those already familiar with OpenMP and MPI. The API is based on the extended C language, and to translate code from this language, the CUDA SDK includes the nvcc command line compiler, based on the open source Open64 compiler.

Let's list the main characteristics of CUDA:

  • unified software and hardware solution for parallel computing on Nvidia video chips;
  • a wide range of supported solutions, from mobile to multi-chip
  • standard C programming language;
  • standard numerical analysis libraries FFT (Fast Fourier Transform) and BLAS (Linear Algebra);
  • optimized data exchange between CPU and GPU;
  • interaction with graphics APIs OpenGL and DirectX;
  • support for 32- and 64-bit operating systems: Windows XP, Windows Vista, Linux and MacOS X;
  • Possibility of development at a low level.

Regarding operating system support, it should be added that all major Linux distributions are officially supported (Red Hat Enterprise Linux 3.x/4.x/5.x, SUSE Linux 10.x), but, judging by data from enthusiasts, CUDA works great on other builds: Fedora Core, Ubuntu, Gentoo, etc.

The CUDA development environment (CUDA Toolkit) includes:

  • nvcc compiler;
  • FFT and BLAS libraries;
  • profiler;
  • gdb debugger for GPU;
  • CUDA runtime driver included with standard Nvidia drivers
  • programming manual;
  • CUDA Developer SDK (source code, utilities and documentation).

In the source code examples: parallel bitonic sort, matrix transposition, parallel prefix summation of large arrays, image convolution, discrete wavelet transform, example of interaction with OpenGL and Direct3D, use of the CUBLAS and CUFFT libraries, calculation of the option price (Black formula Scholes, binomial model, Monte Carlo method), parallel random number generator Mersenne Twister, histogram calculation of a large array, noise reduction, Sobel filter (finding boundaries).

Advantages and Limitations of CUDA

From a programmer's perspective, a graphics pipeline is a collection of processing stages. The geometry block generates the triangles, and the rasterization block generates the pixels displayed on the monitor. The traditional GPGPU programming model looks like this:

To transfer calculations to the GPU within this model, a special approach is needed. Even element-wise addition of two vectors will require drawing the figure on the screen or to an off-screen buffer. The figure is rasterized, the color of each pixel is calculated using a given program (pixel shader). The program reads the input data from the textures for each pixel, adds them and writes them to the output buffer. And all these numerous operations are needed for something that is written in a single operator in a regular programming language!

Therefore, the use of GPGPU for general purpose computing has the limitation of being too difficult to train developers. Yes, and there are enough other restrictions, because a pixel shader is just a formula for the dependence of the final color of a pixel on its coordinate, and the language of pixel shaders is a language for writing these formulas with a C-like syntax. Early GPGPU methods are a neat trick that allows you to use the power of the GPU, but without any of the convenience. The data there is represented by images (textures), and the algorithm is represented by the rasterization process. Of particular note is the very specific model of memory and execution.

Nvidia's software and hardware architecture for GPU computing differs from previous GPGPU models in that it allows you to write programs for the GPU in real C language with standard syntax, pointers and the need for a minimum of extensions to access the computing resources of video chips. CUDA is independent of graphics APIs, and has some features designed specifically for general purpose computing.

Advantages of CUDA over the traditional approach to GPGPU computing:

  • The CUDA application programming interface is based on the standard C programming language with extensions, which simplifies the process of learning and implementing the CUDA architecture;
  • CUDA provides access to 16 KB of thread-shared memory per multiprocessor, which can be used to organize a cache with higher bandwidth than texture fetches;
  • more efficient data transfer between system and video memory
  • no need for graphical APIs with redundancy and overhead;
  • linear memory addressing, gather and scatter, ability to write to arbitrary addresses;
  • hardware support for integer and bit operations.

Main limitations of CUDA:

  • lack of recursion support for executable functions;
  • minimum block width of 32 threads;
  • closed CUDA architecture owned by Nvidia.

The weaknesses of programming with previous GPGPU methods are that these methods do not use vertex shader execution units in previous non-unified architectures, data is stored in textures and output to an off-screen buffer, and multi-pass algorithms use pixel shader units. GPGPU limitations can include: insufficient use of hardware capabilities, memory bandwidth limitations, lack of scatter operation (gather only), mandatory use of the graphics API.

The main advantages of CUDA over previous GPGPU methods stem from the fact that the architecture is designed to make efficient use of non-graphics computing on the GPU and uses the C programming language without requiring algorithms to be ported to a graphics pipeline concept-friendly form. CUDA offers a new path to GPU computing that does not use graphics APIs, offering random memory access (scatter or gather). This architecture does not have the disadvantages of GPGPU and uses all execution units, and also expands capabilities due to integer mathematics and bit shift operations.

Additionally, CUDA opens up some hardware capabilities not available from graphics APIs, such as shared memory. This is a small memory (16 kilobytes per multiprocessor) that thread blocks have access to. It allows you to cache the most frequently accessed data and can provide faster speeds than using texture fetches for this task. Which, in turn, reduces the throughput sensitivity of parallel algorithms in many applications. For example, it is useful for linear algebra, fast Fourier transform, and image processing filters.

Memory access is also more convenient in CUDA. The graphics API code outputs data as 32 single-precision floating-point values ​​(RGBA values ​​simultaneously into eight render targets) into predefined areas, and CUDA supports scatter writing - an unlimited number of records at any address. Such advantages make it possible to execute some algorithms on the GPU that cannot be efficiently implemented using GPGPU methods based on graphics APIs.

Also, graphics APIs necessarily store data in textures, which requires preliminary packaging of large arrays into textures, which complicates the algorithm and forces the use of special addressing. And CUDA allows you to read data at any address. Another advantage of CUDA is the optimized data exchange between the CPU and GPU. And for developers who want low-level access (for example, when writing another programming language), CUDA offers low-level assembly language programming capabilities.

History of CUDA development

CUDA development was announced along with the G80 chip in November 2006, and the release of a public beta version of the CUDA SDK took place in February 2007. Version 1.0 was released in June 2007 to launch Tesla solutions based on the G80 chip and intended for the high-performance computing market. Then, at the end of the year, the beta version of CUDA 1.1 was released, which, despite the slight increase in the version number, introduced quite a lot of new things.

What's new in CUDA 1.1 is the inclusion of CUDA functionality in regular Nvidia video drivers. This meant that in the requirements for any CUDA program it was enough to indicate a video card of the Geforce 8 series and higher, as well as a minimum driver version of 169.xx. This is very important for developers; if these conditions are met, CUDA programs will work for any user. Also added were asynchronous execution along with data copying (only for G84, G86, G92 and higher chips), asynchronous data transfer to video memory, atomic memory access operations, support for 64-bit versions of Windows, and the ability to operate multi-chip CUDA in SLI mode.

At the moment, the current version is for solutions based on GT200 CUDA 2.0, released along with the Geforce GTX 200 line. The beta version was released back in the spring of 2008. The second version added: support for double-precision calculations (hardware support only for the GT200), finally supports Windows Vista (32 and 64-bit versions) and Mac OS X, added debugging and profiling tools, supports 3D textures, optimized data transfer.

As for double precision calculations, their speed on the current hardware generation is several times lower than single precision. The reasons are discussed in ours. The implementation of this support in the GT200 is that FP32 blocks are not used to obtain results at four times lower speed; to support FP64 calculations, Nvidia decided to make dedicated computing units. And in the GT200 there are ten times fewer of them than FP32 blocks (one double-precision block for each multiprocessor).

In reality, performance may be even lower, since the architecture is optimized for 32-bit reading from memory and registers; in addition, double precision is not needed in graphics applications, and in the GT200 it is made more likely to simply be there. And modern quad-core processors show not much less real performance. But being even 10 times slower than single precision, such support is useful for mixed precision designs. One common technique is to initially obtain approximate results in single precision, and then refine them in double precision. Now this can be done directly on the video card, without sending intermediate data to the CPU.

Another useful feature of CUDA 2.0 has nothing to do with the GPU, oddly enough. Simply, you can now compile CUDA code into highly efficient multi-threaded SSE code for fast execution on the CPU. That is, now this feature is suitable not only for debugging, but also for real use on systems without an Nvidia video card. After all, the use of CUDA in regular code is hampered by the fact that Nvidia video cards, although the most popular among dedicated video solutions, are not available in all systems. And before version 2.0, in such cases it would have been necessary to create two different codes: for CUDA and separately for the CPU. And now you can run any CUDA program on a CPU with high efficiency, albeit at a lower speed than on video chips.

Solutions supporting Nvidia CUDA

All CUDA-enabled graphics cards can help accelerate most demanding tasks, from audio and video processing to medicine and scientific research. The only real limitation is that many CUDA programs require a minimum of 256 megabytes of video memory, and this is one of the most important technical characteristics for CUDA applications.

The latest list of CUDA-supporting products can be found at. At the time of writing, CUDA calculations supported all products of the GeForce 200, GeForce 9 and GeForce 8 series, including mobile products starting with the GeForce 8400M, as well as the GeForce 8100, 8200 and 8300 chipsets. Modern Quadro and 8300 products also support CUDA all Teslas: S1070, C1060, C870, D870 and S870.

We especially note that along with the new Geforce GTX 260 and 280 video cards, corresponding solutions for high-performance computing were announced: Tesla C1060 and S1070 (shown in the photo above), which will be available for purchase this fall. They use the same GPU - GT200, in the C1060 there is one, in the S1070 there are four. But, unlike gaming solutions, they use four gigabytes of memory per chip. The only downside is that the memory frequency and bandwidth are lower than those of gaming cards, providing 102 GB/s per chip.

Composition of Nvidia CUDA

CUDA includes two APIs: high-level (CUDA Runtime API) and low-level (CUDA Driver API), although it is impossible to use both at the same time in one program; you must use one or the other. The high-level one works “on top” of the low-level one, all runtime calls are translated into simple instructions processed by the low-level Driver API. But even a “high-level” API assumes knowledge about the design and operation of Nvidia video chips; there is no too high level of abstraction there.

There is another level, even higher - two libraries:

CUBLAS CUDA version of BLAS (Basic Linear Algebra Subprograms), designed for computing linear algebra problems and using direct access to GPU resources;

CUFFT CUDA version of the Fast Fourier Transform library for calculating the fast Fourier transform, widely used in signal processing. The following transformation types are supported: complex-complex (C2C), real-complex (R2C), and complex-real (C2R).

Let's take a closer look at these libraries. CUBLAS are standard linear algebra algorithms translated into CUDA language; currently only a certain set of basic CUBLAS functions are supported. The library is very easy to use: you need to create a matrix and vector objects in the video card memory, fill them with data, call the required CUBLAS functions, and load the results from the video memory back to the system memory. CUBLAS contains special functions for creating and destroying objects in GPU memory, as well as for reading and writing data to this memory. Supported BLAS functions: levels 1, 2 and 3 for real numbers, CGEMM level 1 for complex numbers. Level 1 is vector-vector operations, level 2 is vector-matrix operations, level 3 is matrix-matrix operations.

CUFFT CUDA version of the Fast Fourier Transform function, widely used and very important in signal analysis, filtering, etc. CUFFT provides a simple interface to efficiently compute FFT on Nvidia GPUs without the need to develop your own GPU FFT. The CUDA variant of FFT supports 1D, 2D, and 3D transformations of complex and real data, batch execution for multiple 1D transformations in parallel, the sizes of 2D and 3D transformations can be within , for 1D the size of up to 8 million elements is supported.

Basics of creating programs on CUDA

To understand further text, you should understand the basic architectural features of Nvidia video chips. The GPU consists of several clusters of texture units (Texture Processing Cluster). Each cluster consists of a large block of texture fetches and two or three streaming multiprocessors, each of which consists of eight computing devices and two superfunctional units. All instructions are executed using the SIMD principle, where one instruction is applied to all threads in a warp (a term from the textile industry, in CUDA this is a group of 32 threads the minimum amount of data processed by multiprocessors). This method of execution was called SIMT (single instruction multiple threads one instruction and many threads).

Each of the multiprocessors has certain resources. So, there is a special shared memory of 16 kilobytes per multiprocessor. But this is not a cache, since the programmer can use it for any need, like the Local Store in the SPU of Cell processors. This shared memory allows information to be exchanged between threads of the same block. It is important that all threads of one block are always executed by the same multiprocessor. But threads from different blocks cannot exchange data, and you need to remember this limitation. Shared memory is often useful except when multiple threads are accessing the same memory bank. Multiprocessors can also access video memory, but with higher latencies and worse throughput. To speed up access and reduce the frequency of accessing video memory, multiprocessors have 8 kilobytes of cache for constants and texture data.

The multiprocessor uses 8192-16384 (for G8x/G9x and GT2xx, respectively) registers, common to all threads of all blocks executed on it. The maximum number of blocks per multiprocessor for G8x/G9x is eight, and the number of warps is 24 (768 threads per multiprocessor). In total, top-end video cards of the Geforce 8 and 9 series can process up to 12288 threads at a time. GeForce GTX 280 based on GT200 offers up to 1024 threads per multiprocessor, it has 10 clusters of three multiprocessors processing up to 30720 threads. Knowing these limitations allows you to optimize algorithms for available resources.

The first step when migrating an existing application to CUDA is to profile it and identify areas of code that are bottlenecks that are slowing down the work. If among such sections there are those suitable for fast parallel execution, these functions are transferred to CUDA C extensions for execution on the GPU. The program is compiled using an Nvidia-supplied compiler that generates code for both the CPU and GPU. When a program is executed, the central processor executes its portions of the code, and the GPU executes the CUDA code with the most heavily parallel calculations. This part dedicated to the GPU is called the kernel. The kernel defines the operations that will be performed on the data.

The video chip receives the core and creates copies for each data element. These copies are called threads. A stream contains a counter, registers, and state. For large volumes of data, such as image processing, millions of threads are launched. Threads are executed in groups of 32, called warps. Warps are assigned execution on specific thread multiprocessors. Each multiprocessor consists of eight cores stream processors that execute one MAD instruction per clock cycle. To execute one 32-thread warp, four clock cycles of the multiprocessor are required (we are talking about the shader domain frequency, which is 1.5 GHz and higher).

The multiprocessor is not a traditional multi-core processor; it is highly capable of multi-threading, supporting up to 32 warps at a time. Each clock cycle, the hardware selects which warp to execute and switches from one to the other without losing any clock cycles. If we draw an analogy with a central processor, this is similar to simultaneous execution of 32 programs and switching between them every clock cycle without the loss of context switching. In reality, CPU cores support the simultaneous execution of one program and switch to others with a delay of hundreds of cycles.

CUDA programming model

Let us repeat that CUDA uses a parallel computing model, when each of the SIMD processors executes the same instruction on different data elements in parallel. The GPU is a computing device, a coprocessor (device) for the central processor (host), which has its own memory and processes a large number of threads in parallel. The kernel is a function for the GPU, executed by threads (an analogy from 3D graphics - a shader).

We said above that a video chip differs from a CPU in that it can process tens of thousands of threads simultaneously, which is typical for graphics that are well parallelized. Each stream is scalar and does not require packing data into 4-component vectors, which is more convenient for most tasks. The number of logical threads and thread blocks exceeds the number of physical execution devices, which provides good scalability for the entire range of company solutions.

The CUDA programming model involves grouping threads. Threads are organized into thread blocks—one- or two-dimensional grids of threads that communicate with each other using shared memory and synchronization points. The program (kernel) runs over a grid of thread blocks, see the figure below. One grid is executed at a time. Each block can be one-, two-, or three-dimensional in shape, and can consist of 512 threads on current hardware.

Thread blocks are executed in small groups called warps, the size of which is 32 threads. This is the minimum amount of data that can be processed in multiprocessors. And since this is not always convenient, CUDA allows you to work with blocks containing from 64 to 512 threads.

Grouping blocks into grids allows you to get away from restrictions and apply the kernel to more threads in a single call. This also helps with scaling. If the GPU does not have enough resources, it will execute blocks sequentially. Otherwise, blocks can be executed in parallel, which is important for optimal distribution of work on video chips of different levels, ranging from mobile and integrated.

CUDA memory model

The memory model in CUDA is distinguished by the possibility of byte addressing, support for both gather and scatter. A fairly large number of registers are available for each stream processor, up to 1024 pieces. Access to them is very fast, they can store 32-bit integers or floating point numbers.

Each thread has access to the following memory types:

Global memory the largest amount of memory available for all multiprocessors on a video chip, the size ranges from 256 megabytes to 1.5 gigabytes on current solutions (and up to 4 GB on Tesla). It has high throughput, more than 100 gigabytes/s for top Nvidia solutions, but very high latencies of several hundred cycles. Not cacheable, supports generic load and store instructions, and regular memory pointers.

Local memory this is a small amount of memory that only one stream processor has access to. It is relatively slow - the same as the global one.

Shared memory this is a 16-kilobyte (in video chips of the current architecture) memory block with shared access for all stream processors in the multiprocessor. This memory is very fast, the same as registers. It allows threads to communicate, is directly controlled by the developer, and has low latency. Advantages of shared memory: use as a programmer-controlled first-level cache, reduced latency when accessing execution units (ALUs) to data, reduced number of accesses to global memory.

Constant memory- a memory area of ​​64 kilobytes (the same for current GPUs), read-only by all multiprocessors. It is cached at 8 kilobytes per multiprocessor. Quite slow - a delay of several hundred cycles in the absence of the necessary data in the cache.

Texture memory a block of memory that can be read by all multiprocessors. Data sampling is carried out using the texture blocks of the video chip, so linear data interpolation is possible without additional costs. 8 kilobytes are cached per multiprocessor. Slow as global hundreds of latency cycles when there is no data in the cache.

Naturally, global, local, texture and constant memory are physically the same memory, known as the local video memory of the video card. Their differences lie in different caching algorithms and access models. The CPU can only update and request external memory: global, constant, and texture.

From what was written above, it is clear that CUDA assumes a special approach to development, not quite the same as that adopted in programs for the CPU. You need to remember about the different types of memory, that local and global memory are not cached and the latency when accessing it is much higher than that of register memory, since it is physically located in separate chips.

A typical, but not required, problem solving pattern:

  • the task is divided into subtasks;
  • input data is divided into blocks that fit into shared memory;
  • each block is processed by a block of threads;
  • the subblock is loaded into shared memory from global memory;
  • appropriate calculations are carried out on the data in shared memory;
  • the results are copied from shared memory back to global memory.

Programming environment

CUDA includes runtime libraries:

  • a common part that provides built-in vector types and subsets of RTL calls supported on CPU and GPU;
  • CPU component for controlling one or more GPUs;
  • A GPU component that provides GPU-specific functionality.

The main CUDA application process runs on a universal processor (host), it runs multiple copies of kernel processes on the video card. The code for the CPU does the following: initializes the GPU, allocates memory on the video card and system, copies constants to the video card memory, launches several copies of kernel processes on the video card, copies the result from video memory, frees the memory and exits.

As an example for understanding, here is the CPU code for vector addition presented in CUDA:

The functions executed by the video chip have the following limitations: there is no recursion, no static variables inside the functions and no variable number of arguments. Two types of memory management are supported: linear memory with access via 32-bit pointers, and CUDA arrays with access only through texture fetch functions.

CUDA programs can interact with graphics APIs: to render data generated in the program, to read rendering results and process them using CUDA tools (for example, when implementing post-processing filters). To achieve this, graphics API resources can be mapped (receiving the resource address) into the CUDA global memory space. The following graphics API resource types are supported: Buffer Objects (PBO/VBO) in OpenGL, vertex buffers and textures (2D, 3D and cubemaps) Direct3D9.

Stages of compiling a CUDA application:

CUDA C source code files are compiled using the NVCC program, which is a wrapper over other tools and calls them: cudacc, g++, cl, etc. NVCC generates: CPU code, which is compiled along with the rest of the application written in pure C, and PTX object code for the video chip. Executable files with CUDA code necessarily require the CUDA runtime library (cudart) and CUDA core library (cuda).

Optimization of programs on CUDA

Naturally, it is impossible to consider serious optimization issues in CUDA programming within the framework of a review article. Therefore, we’ll just briefly talk about the basic things. To effectively use the capabilities of CUDA, you need to forget about the usual methods of writing programs for the CPU, and use those algorithms that are well parallelized across thousands of threads. It is also important to find the optimal place for storing data (registers, shared memory, etc.), minimize data transfer between the CPU and GPU, and use buffering.

In general, when optimizing a CUDA program, you should try to achieve the optimal balance between size and number of blocks. More threads per block will reduce the impact of memory latency, but will also reduce the number of registers available. In addition, a block of 512 threads is inefficient; Nvidia itself recommends using blocks of 128 or 256 threads as a compromise to achieve optimal latencies and the number of registers.

Among the main points of optimization of CUDA programs: the most active use of shared memory, since it is much faster than the global video memory of the video card; Reads and writes from global memory should be coalesced whenever possible. To do this, you need to use special data types to read and write 32/64/128 bits of data at once in one operation. If reads are difficult to combine, you can try using texture fetches.

conclusions

The software and hardware architecture presented by Nvidia for calculations on CUDA video chips is well suited for solving a wide range of highly parallel tasks. CUDA runs on a wide range of Nvidia GPUs, and improves the GPU programming model by greatly simplifying it and adding a large number of features such as shared memory, thread synchronization, double precision calculations and integer operations.

CUDA is a technology available to every software developer; it can be used by any programmer who knows the C language. You just have to get used to the different programming paradigm inherent in parallel computing. But if the algorithm is, in principle, well parallelized, then the study and time spent on programming in CUDA will return many times over.

It is likely that due to the widespread use of video cards in the world, the development of parallel computing on GPUs will greatly impact the high-performance computing industry. These possibilities have already aroused great interest in scientific circles, and not only in them. After all, potential opportunities to accelerate algorithms that can be easily parallelized (on available hardware, which is no less important) by tens of times are not so common.

General purpose processors develop rather slowly and do not have such performance leaps. Essentially, while it sounds like a lot of money, anyone in need of fast computing can now have an inexpensive personal supercomputer on their desk, sometimes without even investing extra, since Nvidia graphics cards are so widely available. Not to mention the increased efficiency in terms of GFLOPS/$ and GFLOPS/W that GPU manufacturers love so much.

The future of many computing clearly lies in parallel algorithms, and almost all new solutions and initiatives are directed in this direction. So far, however, the development of new paradigms is at the initial stage; you have to manually create threads and schedule memory access, which complicates the tasks compared to conventional programming. But CUDA technology has taken a step in the right direction and it clearly looks like a successful solution, especially if Nvidia manages to convince as many developers as possible of its benefits and prospects.

But, of course, GPUs will not replace CPUs. In their current form they are not intended for this. Now that video chips are gradually moving towards the CPU, becoming more and more universal (single and double precision floating point calculations, integer calculations), CPUs are becoming more and more “parallel”, acquiring a large number of cores, multi-threading technologies, not to mention the appearance of blocks SIMD and heterogeneous processor projects. Most likely, GPU and CPU will simply merge in the future. It is known that many companies, including Intel and AMD, are working on similar projects. And it doesn’t matter whether the GPUs are absorbed by the CPU, or vice versa.

In the article, we mainly talked about the benefits of CUDA. But there is also a fly in the ointment. One of the few disadvantages of CUDA is its poor portability. This architecture only works on video chips from this company, and not on all of them, but starting with the Geforce 8 and 9 series and the corresponding Quadro and Tesla. Yes, there are a lot of such solutions in the world; Nvidia cites a figure of 90 million CUDA-compatible video chips. This is just great, but competitors offer their own solutions that are different from CUDA. So, AMD has Stream Computing, Intel will have Ct in the future.

Which technology will win, become widespread and live longer than others - only time will tell. But CUDA has a good chance, since compared to Stream Computing, for example, it represents a more developed and easy-to-use programming environment in the regular C language. Perhaps a third party can help with the determination by issuing some general solution. For example, in the next DirectX update under version 11, Microsoft promised computational shaders, which could become some kind of average solution that suits everyone, or almost everyone.

Judging by preliminary data, this new type of shader borrows a lot from the CUDA model. And by programming in this environment now, you can get immediate benefits and the necessary skills for the future. From a high-performance computing perspective, DirectX also has the distinct disadvantage of poor portability, as the API is limited to the Windows platform. However, another standard is being developed - the open multi-platform initiative OpenCL, which is supported by most companies, including Nvidia, AMD, Intel, IBM and many others.

Don't forget that the next article on CUDA will explore specific practical applications of scientific and other non-graphical computing performed by developers from different parts of our planet using Nvidia CUDA.