The first part I did was layer generation. This means a single parallel_for.
I have then parallelized the triangle generation part (the so-called marching cubes). To do so I have a simple parallel_for and each thread writes its computed triangles to a shared concurrent_vector. Very simple and impressively effective.
So this scales quite well and I was able to achieve an overall speedup of 6x on a 8 core machine, over a single core.