CUDA/C++ advice

  • (4 Pages)
  • +
  • 1
  • 2
  • 3
  • Last »

53 Replies - 2438 Views - Last Post: 29 June 2017 - 09:40 AM Rate Topic: -----

#1 doriandiaconu  Icon User is offline

  • New D.I.C Head

Reputation: 1
  • View blog
  • Posts: 47
  • Joined: 05-March 17

CUDA/C++ advice

Posted 03 June 2017 - 06:27 AM

I'm searching for some CUDA advice regarding a certain matter.

I was trying to test the capacity of my Laptop (GeForce 840M and Intel i7) when I came across this code:

https://github.com/B...aussianBlurCuda

It seems that the Gaussian filtering works only until I use pictures around 1000x1000.

I just want to know if this is a hardware limitation on my part or I'm doing something wrong.

Program is running properly, it's just that the output file (filtered image) can't be opened by Windows or other applications.

Thanks!

Is This A Good Question/Topic? 0
  • +

Replies To: CUDA/C++ advice

#2 Salem_c  Icon User is offline

  • void main'ers are DOOMED
  • member icon

Reputation: 2129
  • View blog
  • Posts: 4,195
  • Joined: 30-May 10

Re: CUDA/C++ advice

Posted 03 June 2017 - 08:03 AM

What does "can't be opened" mean?

Have you tried a hex editor application?
Hex editors don't care about what a file is supposed to be, they just display the raw bytes.

From your github, it seems that the output file is a BMP file.
Use your hex editor to verify that the saved header actually represents a valid BMP file.

Most programs have defence mechanisms in place to prevent nonsense happening when presented with broken files. Specially crafted 'broken' files have long been used as an attack vector for malware, so most 'viewer' programs now just refuse to deal with anything suspect.
Was This Post Helpful? 1
  • +
  • -

#3 doriandiaconu  Icon User is offline

  • New D.I.C Head

Reputation: 1
  • View blog
  • Posts: 47
  • Joined: 05-March 17

Re: CUDA/C++ advice

Posted 03 June 2017 - 09:03 AM

Thank you for the fast reply!

By "can't be opened" I mean:

https://ibb.co/cwJEav -> trying to open with Windows Photos

https://ibb.co/mweLFv -> trying to open with Paint 2D

I used this HEX editor: https://hexed.it/

https://ibb.co/gRRZav -> default values

https://ibb.co/kHJQha -> values when loading the output picture

Picture I used as reference (resolution is 1419 x 1001): http://samples.filef...ires=1496505279

When I used the following image which is still a BMP and RGB(512x512): https://software.int...351974/lena.bmp

https://ibb.co/hwOK8F -> this is the output (blurred image like it should be. I tried using OpenCV and obtained the same amount of blur just to be sure it's woking properly).

https://ibb.co/iFKRoF -> this is it's HEX values.

I tried running the program on another big JPEG file (4000x4000) which I tried to convert to BMP through: http://www.zamzar.co...ert/jpg-to-bmp/ and got the same result as I did with the first picture.
Was This Post Helpful? 0
  • +
  • -

#4 Salem_c  Icon User is offline

  • void main'ers are DOOMED
  • member icon

Reputation: 2129
  • View blog
  • Posts: 4,195
  • Joined: 30-May 10

Re: CUDA/C++ advice

Posted 03 June 2017 - 10:46 AM

TBH, there are too many links to figure out what is supposed to be good or bad.

Also, I was rather hoping you would look at the hex dump and figure it out for yourself (so you would have an idea how to fix the source code), rather than hoping someone else would point out the bitmap file errors and tell you what code needs to be changed.
Was This Post Helpful? 1
  • +
  • -

#5 doriandiaconu  Icon User is offline

  • New D.I.C Head

Reputation: 1
  • View blog
  • Posts: 47
  • Joined: 05-March 17

Re: CUDA/C++ advice

Posted 03 June 2017 - 11:04 AM

If I already knew what was wrong I would've fixed it by myself.

That's why I posted here...looking for help...

Thanks anyway!
Was This Post Helpful? 0
  • +
  • -

#6 doriandiaconu  Icon User is offline

  • New D.I.C Head

Reputation: 1
  • View blog
  • Posts: 47
  • Joined: 05-March 17

Re: CUDA/C++ advice

Posted 04 June 2017 - 04:12 AM

The problems weren't related to the program itself, but to the pictures.

I re-encoded JPEG/PNG files to BMP with Photoshop and everything worked good.
Was This Post Helpful? 1
  • +
  • -

#7 jjl  Icon User is offline

  • Engineer
  • member icon

Reputation: 1262
  • View blog
  • Posts: 4,971
  • Joined: 09-June 09

Re: CUDA/C++ advice

Posted 05 June 2017 - 01:35 PM

If your trying to maximize you're hardware performance, then ditch that code, it is fully of warp-divergence, thread queuing, and slow global memory accesses.

Find an a 2D CUDA convolution that utilizes shared memory, and block sizes that are multiple of SM's warp size, you will se a dramatic increases in performance.
Was This Post Helpful? 1
  • +
  • -

#8 doriandiaconu  Icon User is offline

  • New D.I.C Head

Reputation: 1
  • View blog
  • Posts: 47
  • Joined: 05-March 17

Re: CUDA/C++ advice

Posted 11 June 2017 - 12:58 PM

View Postjjl, on 05 June 2017 - 08:35 PM, said:

If your trying to maximize you're hardware performance, then ditch that code, it is fully of warp-divergence, thread queuing, and slow global memory accesses.

Find an a 2D CUDA convolution that utilizes shared memory, and block sizes that are multiple of SM's warp size, you will se a dramatic increases in performance.


Well this was the mentor's idea to start with and to work my way through optimizing the code.

Happily, I managed to get the code a bit faster doing some changes here and there.

But I'm trying to implement shared memory to it, and maybe use uchar3 to store the RGB channels.

Regarding shared memory I tried to follow this: https://stackoverflo...d-memory-useful


And this: https://devblogs.nvi...memory-cuda-cc/


And my result was this: https://paste.ofcode...ni4UCupKe5CtFhA


But it seems I get no output and no speed-up. So I'm sure I'm doing something wrong.

I'd appreciate any help/advice.
Was This Post Helpful? 0
  • +
  • -

#9 Skydiver  Icon User is online

  • Code herder
  • member icon

Reputation: 5756
  • View blog
  • Posts: 19,506
  • Joined: 05-May 12

Re: CUDA/C++ advice

Posted 11 June 2017 - 01:56 PM

Please post your code here, not on an external link which may change or disappear.
Was This Post Helpful? 1
  • +
  • -

#10 jjl  Icon User is offline

  • Engineer
  • member icon

Reputation: 1262
  • View blog
  • Posts: 4,971
  • Joined: 09-June 09

Re: CUDA/C++ advice

Posted 11 June 2017 - 07:36 PM

There is no performance increase because you are not utilizing it. The idea of using shared memory is to have every thread load in their global memory data into a shared memory block at it's corresponding location, then perform all operations using the shared memory.

For example, all of these operations should be accessed from shared memory, not global memory
int B1 = B[idx - width - 1];
	int B2 = B[idx - width];
	int B3 = B[idx - width + 1];
	int B4 = B[idx - 1];
	int B5 = B[idx];
	int B6 = B[idx + 1];
	int B7 = B[idx + width - 1];
	int B8 = B[idx + width];
	int B9 = B[idx + width + 1];

	int R1 = R[idx - width - 1];
	int R2 = R[idx - width];
	int R3 = R[idx - width + 1];
	int R4 = R[idx - 1];
	int R5 = R[idx];
	int R6 = R[idx + 1];
	int R7 = R[idx + width - 1];
	int R8 = R[idx + width];
	int R9 = R[idx + width + 1];

	int G1 = G[idx - width - 1];
	int G2 = G[idx - width];
	int G3 = G[idx - width + 1];
	int G4 = G[idx - 1];
	int G5 = G[idx];
	int G6 = G[idx + 1];
	int G7 = G[idx + width - 1];
	int G8 = G[idx + width];
	int G9 = G[idx + width + 1];



A little pseudo code how things should look

shared_memory[id] = global_memory[id];

sync_threads();

// Perform convolution using shared_memory



You could also load the kernel into a faster memory type, however it is most likely being cached anyways.


Some optimizations tips:

1) To increase performance you want to utilize all of the streaming multiprocessors (SM) on the GPGPU (a streaming multiprocessor supplies instructions to it's set of cuda cores). A block of threads is assigned exclusively to run on a single SM, therefore you want to create atleast as many blocks of threads as their are SMs.

2) SMs schedule it's threads with the resolution size called a Warp (typically a set of 16 threads), thus you should always create your thread block size an even multiple of the warp size in order to maximize SM performance.

3) Finally, use fast memory (like you are attempting now with shared memory).

This post has been edited by jjl: 11 June 2017 - 07:44 PM

Was This Post Helpful? 2
  • +
  • -

#11 doriandiaconu  Icon User is offline

  • New D.I.C Head

Reputation: 1
  • View blog
  • Posts: 47
  • Joined: 05-March 17

Re: CUDA/C++ advice

Posted 11 June 2017 - 11:27 PM

So this means that I will have to do something like

shared_memory[id]= B[id];
shared_memory[id+1]=B[id+1];
shared_memoory[id-1]=B[id-1];


And so on?

I thought that if I initialized the indexes B[idx],R[idx],G[idx] the other variations would be initialized as well.
Was This Post Helpful? 0
  • +
  • -

#12 jjl  Icon User is offline

  • Engineer
  • member icon

Reputation: 1262
  • View blog
  • Posts: 4,971
  • Joined: 09-June 09

Re: CUDA/C++ advice

Posted 12 June 2017 - 02:08 PM

I would probably create three shared memory arrays, one for each color channel. Each thread should only load it's respective element into shared memory.

shared_r[id] = R[id];
shared_g[id] = G[id];
shared_b[id] = B[id];

sync_threads();

// Perform 2D convolution



sync_threads() waits until all threads are on the same instruction.

I suggest implementing an optimized convolution kernel for a single channel, then invoke that kernel for each image channel.

This post has been edited by jjl: 12 June 2017 - 02:15 PM

Was This Post Helpful? 1
  • +
  • -

#13 doriandiaconu  Icon User is offline

  • New D.I.C Head

Reputation: 1
  • View blog
  • Posts: 47
  • Joined: 05-March 17

Re: CUDA/C++ advice

Posted 13 June 2017 - 05:08 AM

I kept switching from using only channel to using all 3 as I couldn't see a big improvement in speed.

So for 1 channel code should look like this:
#define block_size=16

...
__global__ void blur(...){

int id = threadIdx.x;
...
__shared__ int shared_r[block_size];

shared_r[id] = R[id];

sync_threads();

int R1 = shared_r[id - width - 1]; //or should it be R[id-width-1]?;
...

*rest of the convolution*

Was This Post Helpful? 0
  • +
  • -

#14 jjl  Icon User is offline

  • Engineer
  • member icon

Reputation: 1262
  • View blog
  • Posts: 4,971
  • Joined: 09-June 09

Re: CUDA/C++ advice

Posted 13 June 2017 - 04:32 PM

You are only utilizing a single block of threads, thus only a single streaming multiprocessor. CUDA has a logical limit of shared memory per a block of 48KB (not enough to fit an entire image), thus you need create multiple blocks of threads. Each block of threads will run on it's own streaming multiprocessor and have it's own section of shared memory to work with.

You can think of it as each block of threads loads it's chunk of the image into it's shared memory on it's streaming multiprocessor


template<typename T>
__global__ void convolve_kernel(T *image, double *result, int rows,
 int cols, double *kernal,
 int kernal_dim) {
 // Y coordinate of pixel being processed
 int ty = blockIdx.y * blockDim.y + threadIdx.y;

 // X coordinate of pixel being processed
 int tx = blockIdx.x * blockDim.x + threadIdx.x;

 int shared_dim = BLOCK_SIZE / 2;
 __shared__ char shared_mem[BLOCK_SIZE]; // BLOCK SIZE defined elsewhere, equals blockDim.y * blockDim.x (where the dimension size is equivalent)

 shared_mem[ty * (shared_dim) + tx] = image[ty + cols + tx];

 sync_threads(); // wait for all threads within this block to finish loading their pixel data to shared memory

 // Now perform convolution with shared_mem 

 }



Still, I suggest you tell you're mentor to ditch the code you have ... it's filled with inefficiencies and warp divergence. It's a bad base to optimize on and you don't understand it to make it worst. It will be better if you start from scratch and start with a basic convolution kernel.

I'll throw you a bone, I'd start with this basic 2D convolution kernel (no shared memory). Understand it in and out, then implement shared memory.

/* Function Name: convolve_kernel
 * Author: jjl
 * Date: February 5th, 2015
 * Brief: CUDA kernel to perform image convolution with a filter
 * Param [in]: image - The input image
 * Param [out]: result - The result image
 * Param [in]: rows - The number of rows in the input image
 * Param [in]: cols - The number of columns in the input image
 * Param [in]: kernal - The input filter kernal
 * Param [in]: kernel_dim: The size of the input filter kernal
 */
template<typename T>
__global__ void convolve_kernel(T *image, double *result, int rows,
    int cols, double *kernal,
    int kernal_dim) {
    int ty = blockIdx.y * blockDim.y + threadIdx.y;
    int tx = blockIdx.x * blockDim.x + threadIdx.x;
    int kernel_offset = kernal_dim / 2.0f;
    int image_row = ty;
    int image_col = tx;

    // Ignore border pixels
    if(image_row >= kernel_offset && image_row < rows - kernel_offset
       && image_col >= kernel_offset && image_col < cols - kernel_offset) {

       double value = 0.0f;
       for(int i=0; i<kernal_dim; ++i) {
          int row = (image_row - kernel_offset) + i;

          for(int j=0; j<kernal_dim; ++j) {
             int col = (image_col - kernel_offset) + j;
             value += kernal[i * kernal_dim + j] *
             (double)image[row * cols + col];
          }
       }
    result[image_row * cols + image_col] = (double)value;
 }
} 


Was This Post Helpful? 2
  • +
  • -

#15 doriandiaconu  Icon User is offline

  • New D.I.C Head

Reputation: 1
  • View blog
  • Posts: 47
  • Joined: 05-March 17

Re: CUDA/C++ advice

Posted 17 June 2017 - 02:30 AM

Well it's not much I'm doing in terms of the code.

I already got the basic 2D convolution code as I tried changing some things up, but in the end switched to the code I have now.
Was This Post Helpful? 0
  • +
  • -

  • (4 Pages)
  • +
  • 1
  • 2
  • 3
  • Last »