In this post I’m going to talk about my recent forays into OpenCL development on my laptop. I’ll end up showing how to put together a parallel reduction sum kernel in OpenCL.

My laptop has Intel integrated graphics, specifically the Intel® HD Graphics 520 (Skylake GT2) card (In the case of integrated graphics, I’m not exactly sure how to refer to it, as is not a discrete graphics card like you might find on a desktop workstation. Nonetheless, for the purposes of this post, I’m going to be referring to my laptop’s graphics “card”). I’ve always wondered if I can use this for general purpose GPU programming. It turns out that I can, with a little messing around.

Given that my card isn’t made by Nvidia, I can’t use CUDA. However, most contemporary Intel integrated graphics cards do support OpenCL. OpenCL is sort of an open source version of CUDA that supports “heterogenous hardware”. This means that in principle, I don’t even need to have a graphics card to run OpenCL applications.

In the past, I’ve tried to get my Intel card to play with OpenCL, but I couldn’t figure it out. That was sometime last year when I was running Fedora. Now that I’m running Ubuntu, installing OpenCL drivers for my graphics card is as simple as apt-get install intel-opencl-icd (taken from here).

OpenCL applications require far more boilerplate than CUDA applications. Managing memory and calling kernels is relatively transparent in CUDA, while OpenCL requires a much more “hands on” approach. Just to give you some idea, here’s an example of a simple CUDA program that adds the elements of two arrays together, saving the result in a third:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
// add_vectors.cu
#include <vector>

#include <cuda_runtime.h>

#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
   if (code != cudaSuccess)
   {
      fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
      if (abort) exit(code);
   }
}


__global__ void add_vectors (
  const float* vec0,
  const float* vec1,
  float* res,
  const int size
)
{
  const int idx = threadIdx.x + blockIdx.x*blockDim.x;
  if (idx > size) {
    return;
  }
  res[idx] = vec0[idx] + vec1[idx];
}


int main () {
  int size = 1024;
  const int max_thread_size = 1024;

  std::vector<float> vec0_h (size, 2.0);
  std::vector<float> vec1_h (size, 2.0);
  std::vector<float> res_h (size);

  float* vec0;
  float* vec1;
  float* res;

  gpuAssert(cudaMalloc((void**) &vec0, sizeof(float)*size));
  gpuAssert(cudaMemcpy(vec0, vec0_h.data(), sizeof(float)*size, cudaMemcpyHostToDevice));
  gpuAssert(cudaMalloc((void**) &vec1, sizeof(float)*size));
  gpuAssert(cudaMemcpy(vec1, vec1_h.data(), sizeof(float)*size, cudaMemcpyHostToDevice));
  gpuAssert(cudaMalloc((void**) &res, sizeof(float)*size));

  int grid_size = 1;
  int block_size = size;

  if (block_size > max_thread_size) {
    block_size = max_thread_size;
    grid_size = size / max_thread_size;
  }

  add_vectors<<<grid_size, block_size>>>(vec0, vec1, res, size);
  gpuAssert(cudaGetLastError());

  gpuAssert(cudaMemcpy(res.data(), res, sizeof(float)*size, cudaMemcpyDeviceToHost));

}

Pretty straightforward. Allocate some device memory, copy our host allocated memory to it, call the kernel, and then copy the results out of device memory. Now let’s see the equivalent OpenCL program:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214
// add_vectors.cpp
#include <iostream>
#include <string>
#include <sstream>
#include <fstream>
#include <vector>

#define CL_TARGET_OPENCL_VERSION 220
#include <CL/cl.h>

std::string getErrorString(cl_int error)
{
switch(error){
    // run-time and JIT compiler errors
    case 0: return "CL_SUCCESS";
    case -1: return "CL_DEVICE_NOT_FOUND";
    case -2: return "CL_DEVICE_NOT_AVAILABLE";
    case -3: return "CL_COMPILER_NOT_AVAILABLE";
    case -4: return "CL_MEM_OBJECT_ALLOCATION_FAILURE";
    case -5: return "CL_OUT_OF_RESOURCES";
    case -6: return "CL_OUT_OF_HOST_MEMORY";
    case -7: return "CL_PROFILING_INFO_NOT_AVAILABLE";
    case -8: return "CL_MEM_COPY_OVERLAP";
    case -9: return "CL_IMAGE_FORMAT_MISMATCH";
    case -10: return "CL_IMAGE_FORMAT_NOT_SUPPORTED";
    case -11: return "CL_BUILD_PROGRAM_FAILURE";
    case -12: return "CL_MAP_FAILURE";
    case -13: return "CL_MISALIGNED_SUB_BUFFER_OFFSET";
    case -14: return "CL_EXEC_STATUS_ERROR_FOR_EVENTS_IN_WAIT_LIST";
    case -15: return "CL_COMPILE_PROGRAM_FAILURE";
    case -16: return "CL_LINKER_NOT_AVAILABLE";
    case -17: return "CL_LINK_PROGRAM_FAILURE";
    case -18: return "CL_DEVICE_PARTITION_FAILED";
    case -19: return "CL_KERNEL_ARG_INFO_NOT_AVAILABLE";

    // compile-time errors
    case -30: return "CL_INVALID_VALUE";
    case -31: return "CL_INVALID_DEVICE_TYPE";
    case -32: return "CL_INVALID_PLATFORM";
    case -33: return "CL_INVALID_DEVICE";
    case -34: return "CL_INVALID_CONTEXT";
    case -35: return "CL_INVALID_QUEUE_PROPERTIES";
    case -36: return "CL_INVALID_COMMAND_QUEUE";
    case -37: return "CL_INVALID_HOST_PTR";
    case -38: return "CL_INVALID_MEM_OBJECT";
    case -39: return "CL_INVALID_IMAGE_FORMAT_DESCRIPTOR";
    case -40: return "CL_INVALID_IMAGE_SIZE";
    case -41: return "CL_INVALID_SAMPLER";
    case -42: return "CL_INVALID_BINARY";
    case -43: return "CL_INVALID_BUILD_OPTIONS";
    case -44: return "CL_INVALID_PROGRAM";
    case -45: return "CL_INVALID_PROGRAM_EXECUTABLE";
    case -46: return "CL_INVALID_KERNEL_NAME";
    case -47: return "CL_INVALID_KERNEL_DEFINITION";
    case -48: return "CL_INVALID_KERNEL";
    case -49: return "CL_INVALID_ARG_INDEX";
    case -50: return "CL_INVALID_ARG_VALUE";
    case -51: return "CL_INVALID_ARG_SIZE";
    case -52: return "CL_INVALID_KERNEL_ARGS";
    case -53: return "CL_INVALID_WORK_DIMENSION";
    case -54: return "CL_INVALID_WORK_GROUP_SIZE";
    case -55: return "CL_INVALID_WORK_ITEM_SIZE";
    case -56: return "CL_INVALID_GLOBAL_OFFSET";
    case -57: return "CL_INVALID_EVENT_WAIT_LIST";
    case -58: return "CL_INVALID_EVENT";
    case -59: return "CL_INVALID_OPERATION";
    case -60: return "CL_INVALID_GL_OBJECT";
    case -61: return "CL_INVALID_BUFFER_SIZE";
    case -62: return "CL_INVALID_MIP_LEVEL";
    case -63: return "CL_INVALID_GLOBAL_WORK_SIZE";
    case -64: return "CL_INVALID_PROPERTY";
    case -65: return "CL_INVALID_IMAGE_DESCRIPTOR";
    case -66: return "CL_INVALID_COMPILER_OPTIONS";
    case -67: return "CL_INVALID_LINKER_OPTIONS";
    case -68: return "CL_INVALID_DEVICE_PARTITION_COUNT";

    // extension errors
    case -1000: return "CL_INVALID_GL_SHAREGROUP_REFERENCE_KHR";
    case -1001: return "CL_PLATFORM_NOT_FOUND_KHR";
    case -1002: return "CL_INVALID_D3D10_DEVICE_KHR";
    case -1003: return "CL_INVALID_D3D10_RESOURCE_KHR";
    case -1004: return "CL_D3D10_RESOURCE_ALREADY_ACQUIRED_KHR";
    case -1005: return "CL_D3D10_RESOURCE_NOT_ACQUIRED_KHR";
    default: return "Unknown OpenCL error";
    }
}

#define clErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cl_int error, const char *file, int line, bool abort=true) {
   if (error != CL_SUCCESS) {
    fprintf(stderr, "GPUassert: %s %s %d\n", getErrorString(error).c_str(), file, line);
    if (abort) {
      exit(error);
    }
  }
}


int main(int argc, char** argv) {
  int size = 1024;

  std::vector<float> vec0_h(size, 1.0);
  std::vector<float> vec1_h(size, 2.0);
  std::vector<float> res_h(size);

  std::string source_str = R"==(
  __kernel void add_vectors(
    __global const float *vec0,
    __global const float *vec1,
    __global float *res,
    const int size
  )
  {
    int idx = get_global_id(0);
    if (idx > size) {
      return;
    }
    res[idx] = vec0[idx] + vec1[idx];
  })==";


  // Get platform and device information
  cl_platform_id platform_id = NULL;
  cl_device_id device_id = NULL;
  cl_uint ret_num_devices;
  cl_uint ret_num_platforms;
  cl_int ret = clGetPlatformIDs(1, &platform_id, &ret_num_platforms);
  ret = clGetDeviceIDs(
    platform_id, CL_DEVICE_TYPE_DEFAULT, 1,
    &device_id, &ret_num_devices
  );
  // Create an OpenCL context
  cl_context context = clCreateContext(NULL, 1, &device_id, NULL, NULL, &ret);

  // Create a command queue
  cl_command_queue command_queue = clCreateCommandQueueWithProperties(context, device_id, 0, &ret);
  clErrchk(ret);

  // Create a program from the kernel source
  const char* source_c_str = source_str.c_str();
  const size_t source_size = source_str.size();
  cl_program program = clCreateProgramWithSource(
    context, 1,
    (const char **) &source_c_str,
    (const size_t *) &source_size,
    &ret
  );
  clErrchk(ret);

  // Create memory buffers on the device for each vector
  cl_mem vec0_mem_obj = clCreateBuffer(context, CL_MEM_READ_ONLY, size*sizeof(float), NULL, &ret);
  clErrchk(ret);
  cl_mem vec1_mem_obj = clCreateBuffer(context, CL_MEM_READ_ONLY, size*sizeof(float), NULL, &ret);
  clErrchk(ret);
  cl_mem res_mem_obj = clCreateBuffer(context, CL_MEM_WRITE_ONLY, size*sizeof(float), NULL, &ret);
  clErrchk(ret);

  // Copy the lists A and B to their respective memory buffers
  clErrchk(
    clEnqueueWriteBuffer(
      command_queue, vec0_mem_obj, CL_TRUE, 0,
      size*sizeof(float), vec0_h.data(), 0, NULL, NULL));
  clErrchk(
    clEnqueueWriteBuffer(
      command_queue, vec1_mem_obj, CL_TRUE, 0,
      size*sizeof(float), vec1_h.data(), 0, NULL, NULL));

  // Build the program
  clErrchk(clBuildProgram(program, 1, &device_id, NULL, NULL, NULL));

  // Create the OpenCL kernel
  cl_kernel kernel = clCreateKernel(program, "add_vectors", &ret);
  clErrchk(ret);

  // Set the arguments of the kernel
  clErrchk(clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&vec0_mem_obj));
  clErrchk(clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&vec1_mem_obj));
  clErrchk(clSetKernelArg(kernel, 2, sizeof(cl_mem), (void *)&res_mem_obj));
  clErrchk(clSetKernelArg(kernel, 3, sizeof(const int), (void *)&size));

  // Execute the OpenCL kernel on the list
  const size_t max_local_size = 256; // we can query to figure this out.


  size_t global_size = size; // Process the entire list
  size_t local_size = size; // max size
  if (local_size > max_local_size) {
    local_size = max_local_size;
  }

  clErrchk(clEnqueueNDRangeKernel(
    command_queue, kernel, 1, NULL,
    &global_size, &local_size,
    0, NULL, NULL
  ));

  // Read the memory buffer C on the device to the local variable C
  clErrchk(clEnqueueReadBuffer(
    command_queue, res_mem_obj, CL_TRUE, 0,
    size*sizeof(float), res_h.data(), 0, NULL, NULL
  ));

  // Clean up
  ret = clFlush(command_queue);
  ret = clFinish(command_queue);
  ret = clReleaseKernel(kernel);
  ret = clReleaseProgram(program);
  ret = clReleaseMemObject(vec0_mem_obj);
  ret = clReleaseMemObject(vec1_mem_obj);
  ret = clReleaseMemObject(res_mem_obj);
  ret = clReleaseCommandQueue(command_queue);
  ret = clReleaseContext(context);
  return 0;
}

Like I said, there is a lot more boilerplate here. We have to create an OpenCL context and a command queue, and then build our program (which is stored as a string literal no less). Once our program is built, we get the kernel we want, and manually set its arguments before finally calling it. Managing memory is definitely not as clear as cudaMalloc and cudaMemcpy. Perhaps most annoying is the fact that there is no builtin equivalent to cudaGetErrorString, hence the lengthy getErrorString function. Let me know if I should do a walkthrough of this code!

Now, the Khronos group has a C++ OpenCL runtime API, but I for the life of me could not get it to work on my system. In my N-body project, I’ve put together some code that abstracts away a lot of this boilerplate, allowing for CUDA-like convenience when calling OpenCL kernels.

The fact that we’re storing our program as a string literal has important implications for how we approach OpenCL development. We cannot #include OpenCL kernels in the same way we can with CUDA kernels. Moreover, in this example, we are compiling our OpenCL kernel at runtime. This means that we could store our OpenCL kernel in a separate file, and then read and compile its contents at runtime. This is fine, even sort of desirable for toy examples, as it means that we don’t have to recompile our program if we change our kernel (assuming we’re storing it as a separate file and reading it in at runtime). Things become more complicated if we want to bundle OpenCL kernels with library code like we do with CUDA applications. The best solution I’ve come up with is to do something like the following:

#include <string>

const std::string add_vectors_str =
#include "add_vectors.cl"
;

int main () {
  std::cerr << add_vectors_str << std::endl;
}

In add_vectors.cl, we’ve wrapped up our kernel in C++11 string literal quotes:

R"==(
__kernel void add_vectors(
  __global const float *vec0,
  __global const float *vec1,
  __global float *res,
  const int size
)
{
  int idx = get_global_id(0);
  if (idx > size) {
    return;
  }
  res[idx] = vec0[idx] + vec1[idx];
}
)=="

From the vector addition examples I showed, CUDA and OpenCL kernel syntax is pretty similar. At first glance, the biggest difference between the two is the way you lay out threads. In CUDA we specify a grid of blocks; when we write kernel<<<2, 64>>>(...); we’re saying we want a grid of 2 blocks with 64 threads in each block. In OpenCL we specify the total number of threads we want, and the number of threads per block:

size_t global_size = 128;
size_t local_size = 64;
clEnqueueNDRangeKernel(
  command_queue, kernel, 1, NULL,
  &global_size, &local_size,
  0, NULL, NULL
);

Here, local_size is the number of “work items” (threads) in a “work group” (block), and global_size is the total number of threads we’ll have at our disposal. Just like in CUDA, there is a maximum number of allowable threads in a block. We can use clGetKernelWorkGroupInfo to figure out what this is. I used the OpenCL 2.2 Quick Reference Guide to figure out the name of this function.

What about more “advanced” features, like warp reduction? This requires shared memory, kernel synchronization, and some means of getting data from adjacent threads. Note that a warp in OpenCL terminology is a “subgroup”. From what I can tell, OpenCL doesn’t have a __shfl_down_sync function like CUDA, but it does have sub_group_reduce_add, which is a much easier (though less explicit) way of adding up data from within a warp. As far as block/work group local data, CUDA’s __shared__ memory is __local memory in OpenCL. However, I don’t think that OpenCL has a means of dynamically allocating this memory at runtime. If I figure this out, I’ll update this post. How about thread synchronization? CUDA has sync_threads, and OpenCL has barrier(CLK_LOCAL_MEM_FENCE). Neither CUDA nor OpenCL has any means of synchronizing different thread blocks.

To show all of these features in action, here’s an OpenCL kernel that sums up an array, using warp (subgroup) reduction:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
__kernel void sum (
  __global float* arr,
  const int size
)
{
  __local float a[32];
  // size_t warp_size = 32;
  const size_t sub_group_size = get_sub_group_size();

  const size_t local_id_0 = get_local_id(0);
  const size_t local_size_0 = get_local_size(0);

  const size_t warp_num = local_id_0 / sub_group_size;
  const size_t warp_lane = local_id_0 % sub_group_size;

  float updates0 = 0.0;

  for (size_t idx=local_id_0; idx<size; idx+=local_size_0) {
    updates0 += arr[idx];
  }
  // printf("%i, %i, %f\n", sub_group_size, group_id_0, updates0);
  updates0 = sub_group_reduce_add (updates0);
  if (warp_lane == 0) {
    a[warp_num] = updates0;
  }
  barrier(CLK_LOCAL_MEM_FENCE);

  if (warp_num == 0) {
    updates0 = a[warp_lane];
    updates0 = sub_group_reduce_add (updates0);
    if (warp_lane == 0) {
      arr[0] = updates0;
    }
  }

}