I just completed my first video filter that is CUDA accelerated using the Premiere Pro SDK for CS6 and CC7. If suitable CUDA hardware is available, it use it, but if not then it uses a multithreaded software implementation. Both work extremely well. However, the CUDA implementation could be a lot faster if the source and destination memory buffers were "pinned" memory. Since they are not, I must copy the source and destination memory to a pinned buffer, and then asynchronously copy that to CUDA device memory and back. The overhead to copy from the source/destination memory to pinned memory is significant. Without the copy to pinned memory the CUDA on my laptop is fast enought to process 130 fps for a 1920 by 1080 HD video. However with the pinned memory I only get about 45 fps.
If I do the exact same filter in DirectShow, the source and destination buffer pools are always pinned and the filter runs much faster than it does on Premiere Pro. I noticed that the new GPU filter example uses and AE interface, but it does allow access to pinned memory. However, I have not mastered the AE interface, and I am reluctant to giving up 13 years of learning curve on the Premiere SDK.
Is there any good reason why the source and destination buffer pool in the Premiere Pro SDK is not pinned memory?
Gene