Part of the performance surprise is that I'm prototyping the algorithms on my laptop which granted is a few years old, but was pretty high end when I got it and the math library in use sees what I'm doing and says, okay, I'll just offload all that to the GPU and I get my result instantly, then I try it on the hardware it needs to run on and all of that gets pushed back to a much slower CPU.
That might be long term motivation to port the whole codebase over to something that can run on my laptop.