I'm playing around with Compute Shaders and I'd like to make the application hardware aware.
For example, if the computer has a GPU with 2560 compute cores, I'd like the program to distribute the workload over at most 2560 threads so that they all get done in one "step". I don't want to assign 2561 threads because that would take two "steps" where the second step takes just as long as the first, but is barely utilizing the GPU.