We actually do use NVIDIA Tensor Cores (found in RTX GPUs) since PL5, that’s how it became faster on Windows. This is done by using fp16 which also speed-ups processing on latest AMD GPUs.
We adapted our algorithm to be executable on Apple Neural Engine rather than only M1’s GPU.