A side effect of this flag was that we had to always use notify_all
when starting a job, instead of notify_one -- in practice, we found
experimentally that more calls to notify_one is cheaper than fewer
calls to notify_all.
- parallel_for doesn't use recursion anymore to create the "leaf"
jobs, this is now done linearly on N thread (one thread per CPU).
This uses less stack space, and reduces miss-predicted branches.
- remove almost all SYSTRACE calls because they have a huge impact
on things like parallel_for() and are misleading. They can be
enabled again by setting HEAVY_SYSTRACE to true.
- only signal waitAndRelease() when the corresponding job finishes and
only if there is waitAndRelease() active -- instead of signaling
every time a job ends.
- don't surrender time slice when attempting to steal a job and it fails
as long as some queue has jobs.
- check that we have to wait, because taking the lock
- add a benchmark
This change more than doubles the amount of jobs we can handle per
second (~965,000 jobs/s on Pixel3)