swr/rast: reduce simd{16}vertex stack for VS output
Frontend - reduce simdvertex/simd16vertex stack usage for VS output in
ProcessDraw, fixes stack overflow in some of the deeper call stacks under
SIMD16.
1. Move the vertex store out of PA_FACTORY, and off the stack
2. Allocate the vertex store out of the aligned heap (pointer is
temporarily stored in TLS, but will be migrated to thread pool
along with other frontend temporary buffers).
3. Grow the vertex store as necessary for the number of verts per
primitive, in chunks of 8/4 simdvertex/simd16vertex
Reviewed-by: Bruce Cherniak <bruce.cherniak@intel.com>