Parallel Computing using the GPU – Tutorial 2, Functions

image
As we got out first application running, it’s time to write more spesific CUDA applications. In this tutorial, we will see how functions work, and how to decide if a function should run on the CPU(the host) or the GPU(the device).

Running a function on the host
To run a function on the host, we simply do what we usually do. Create a function and call that function from anywhere in our program.

Let’s try this it. First of all, start you favorite text editor and type the following code:

#include <stdio.h>

void hostFunction()
{
    printf( “writing from hostFunction()!\n” );
}

int main( void )
{
    printf( “Starting application!\n” );
    hostFunction();
    return 0;
}

Now, save the file as “fHost.cu”, compile the code as we did in the previous tutorial by typing the following command:
image

The application will now compile and create an EXE file named fHost.exe. If you run the example, you will see the (hopefully expected) output:
image

This application simply just calls the function hostFunction(), do what it should do and exit.

Running a function on the device
A function that will run from a device is often called a “Kernel”. A kernel got some limits on what you are allowed to do in it, like calling functions (so printf is not allowed). Let’s take a look at an example:

#include <stdio.h>

__global__ void kernelFunction()
{
}

int main( void )
{
    printf( “Starting application!\n” );
    kernelFunction<<<1,1>>>();
    return 0;
}

Now, this is more interessting! First of all, you will probably notice the abnormal looking __global__ definition? Well, it’s not very hard, all this does is to say “Hey, this function will run on a device”. Basically, main() will run on the host, and kernelFunction() will run on the device.

The next thing you will notice is that calling the kernelFunction() doesn’t look very healthy. The angular brackets with the two parameters is influencing how the device will run and handle this function, but we will cover this more closely soon.

Now, compile this code and run it. Congratulations, you’ve just run your first kernel call! Let’s make this more advanced.

Passing parameters to a kernel
To make it more interessting, it’s time to make the kernel DO something, like multiplying x with y. First of all, a kernel cannot return anything from the function using the “return” keyword. You will need to store the result in the memory. But wait? What memory? The device got it’s own memory on the GPU and the host is using another memory in a galaxy far far away! How do we combat this problem? Luckily, CUDA got some helper functions for this. Let’s just see the code, and work with that.

#include <stdio.h>

__global__ void kernelFunction( int x, int y, int *r)
{
    *r=x*y;
}

int main( void )
{
    printf( “Starting application!\n” );
    int result;
    int *device_result;
    cudaMalloc((void**)&device_result, sizeof(int));
    kernelFunction<<<1,1>>>(5,4, device_result);
    cudaMemcpy( &result,
                device_result,
                sizeof(int),
                cudaMemcpyDeviceToHost );
    printf(“5 * 4 = %d”, result);
    cudaFree(device_result);
    return 0;
}

Diving into the code
The kernelFunction should not look very strange, all it does is to multiply x with y, storing the result into r, a memory location on the GPU.

But we are introduced with three new functions, the cudaMalloc, the cudaMemcpy and cudaFree. These functions are used to handle memory allocation on the device, and copy it to/from the host or a device.

cudaMalloc((void**)&device_result, sizeof(int));
This code works the same way as we are used to, allocating space for an integer, device_result, on the device.

cudaMemcpy( &result,device_result,sizeof(int),cudaMemcpyDeviceToHost );
This code will copy the content of device_result from the device memory, and store it in result. The last parameter, cudaMemcpyDeviceToHost, is telling the function that it will copy the content of a memorylocation on the device, to a memory location on the host.
You can also do the reverse, copy data from the Host to the Device by instead of using cudaMemcpyDeviceToHost, use cudaMemcpyHostToDevice. Also, you can copy data from one location on the device to another by using cudaMemcpyDeviceToDevice. If you want to copy data from the host to another location on the host, just use the normal memcpy function.

It’s important to not mix these as the compiler won’t notice this, and make debugging really hard!

cudaFree(device_result);

The last code will simply free the allocated data from the device, remember to do this!

Moving on, this application stores an integer at the device, calls the kernel that stores the result of x*y in this memory. Once the function is done, we copy the content from the device to the host and print it.

Now, compile and run this application. It will multiply 5 with 4, store the result, 20, on the device, copy it to the host, and the host will print this out.
image
image

Thats it, you have now done your first calulation on the GPU! Smilefjes Wasn’t too hard, was it?

See you in the next tutorial!

Advertisements
This entry was posted in CUDA, Parallel Computing. Bookmark the permalink.

3 Responses to Parallel Computing using the GPU – Tutorial 2, Functions

  1. ronaldpan says:

    I got the wrong output: 5*4=4274747

  2. digitalerr0r says:

    It looks like you haven’t updated the drivers of you GPU, or that CUDA is unsupported by your GPU. Try updating your drivers and see if it helps.

  3. by writing __global__ we told it that it will run on a device but where did we tell him that the device is the GPU?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s