The application of the swizzles allows the remapped vec4 a, b and r variables to perform four straight linear 32 bit XOR operations where a scalar processor would be required to perform 16 byte-level individual operations. Given wide enough SIMD backends in hardware these 3 bit XORs may be done as single-cycle operations across the entire 128 bit Rijndael Matrix.
The other alternative is to simply perform the actual 4x4 GF(256) Matrix Multiply using the MDS Matrix.
+
+# TODO
+
+investigate https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6879380/#!po=19.6429