[QUOTE=giordi;23679]Thanks a lot guys !
This was just a simple test I will try to make more meaningfull tests.
Might be that I will do that in cpp tho , since I would like to end up doing something in maya and cuda.[/QUOTE]
Hey, nice to have someone else tinkering with Parallel Computing! Actually i spent a large amount of the last semester playing with CUDA.
I mean, you dont exactly need to be a genius to make the connection and see the prospect (…even i could ):
Large models with lots of vertices -> large sets of simultaneous operations -> Deformer with parallel computation -> speed boost beyond compare.
I was planning on doing a blog post about my experiences, but since i dont have a blog, i’ll probably just use this thread…
DISCLAIMER: I am by no means an expert on GPU paralellization or CPP, i just write my experiences here.
Conclusion (i’ll start with this for people who, like me, are lazy readers):
While it was seriously fun to toy around with it, it is really hard to get a serious speed benefit out of paralell computing on the GPU unless the problem you are trying to solve
is really “custom built” for that and the plugin in is custom built for the GPU. I basicaly found that using OpenMP is often a far better and “stupidly” easy to set up solution in the typical,
“I have something and now want to make it parallel example”. (Of course it doesnt compare to GPU parallelization, but its good at parallelizing average tasks “on the fly”).
I will basicaly only look at GPU paralellization again, when
A: I have some massive data to crunch which doesnt need any user interaction while running
B: Shared data architectures are coming (They probably have the power to make GPU paralellization really shine…)
And that doesnt seem to be only my experience. The developers of the Fabric Engine, do the same. We had a workshop with one of the developers, and he said they are basicaly only
wrapping CPU paralellization until shared memory architectures arrive, because the memory copy bottleneck makes it slower than without GPU.
Apart from any parallelization, i think when you see how performant KL compiled tools are running, its a miracle what code optimization ala LLVM can do.
(LLVM or not, writing high performance CPP code is probably the one thing that sets TDs apart from “real” programmers…)
My experience:
I wrote a verlet cloth solver as a Maya CPP plugin that lets you chose to either run serial, OpenMP parallelized or on the GPU using CUDA.
The fastest of these is by far the OpenMP version. CUDA doesnt make any sense at all, performance wise.
The reasons for that are the following:
1. The verlet algorythm isnt particularly a genius pick for GPU computation. Not all computation spreads on the vertices, as soon as you integrate constraints
you have to iterate edges, and you need to iterate them serial (even on the GPU inside a kernel). Experience teaches you…
2. While i tried to optimize the memcopying as much as possible, dividing into static and dynamic data, with static solver data only copied once to the GPU on the initial frame,
that nevertheless proved to be a serious speed killer. You might almost never end up in a situation with no dynamic data, that you need to upload on each step of computation.
Just think about keyframe animation of parameters…A buddy here from the acamedy did a sand solver for Max and he has great speed benefits, but as far as i’m informed
he copies the data once on the initial frame and doesnt allow any animation at all.
I also did the same test that you did with the image operation, only with CPP, and actually CUDA was on par…tendentially slower (only tested on a 2k image though…).
It would be interesting if you did the tests you did also with the operation running serial within a kernel, just to make
sure the difference is really a result of paralellization (and you are not comparing a solution programmed in python with wrapped algorythms, for example).
The second thing i did last semester was playing with the Python C API, understanding how to wrap my CPP in Python (for speed critical parts) and how Python works under the hood.
While i was at that, i also testwise wrapped some kernels, so that you could add up lists on the GPU or do the Houdini sine deform example on the GPU.
I mean i do these things like a TD not like a real programmer, but what i discovered (…i always discover things, which seem so logical afterwards…):
Wrapping CUDA for Python not only gives you the memcopy issue which takes time for infrastructural work, but also the work to extract the data out of Pythons own structures.
For example adding two lists on the GPU:
1. You need to create two array out of the given PyObject*. Whenever i tried to parallelize that (OpenMP), it failed. I think the Python API is not threadsafe, for this task.
2. So that is basically serial and it will be a huge time effort linear to the amount of data…but on the other hand, there’s no point to parallelize on the GPU if the data amount is not massive.
Pheww, long post…
But anyhow, if you are interested in programming, i think GPU paralellization is a particularly fascinating area.
And if you do it right and pick the right problem to solve, you might be able to get a great performance gain.
But to use a quote from the KL workshop: “Parallelization is almost never efficient when added on top of a programm later on”…this might be twice true for parallelization on the GPU.