8283435: AArch64: [vectorapi] Optimize SVE lane/withLane operations for 64/128-bit vector sizes
This patch optimizes the SVE backend implementations of Vector.lane and Vector.withLane for 64/128-bit vector size. The basic idea is to use lower costs NEON instructions when the vector size is 64/128 bits. 1. Vector.lane(int i) (Gets the lane element at lane index i) As SVE doesn’t have direct instruction support for extraction like "pextr"[1] in x86, the final code was shown as below: ``` Byte512Vector.lane(7) orr x8, xzr, #0x7 whilele p0.b, xzr, x8 lastb w10, p0, z16.b sxtb w10, w10 ``` This patch uses NEON instruction instead if the target lane is located in the NEON 128b range. For the same example above, the generated code now is much simpler: ``` smov x11, v16.b[7] ``` For those cases that target lane is located out of the NEON 128b range, this patch uses EXT to shift the target to the lowest. The generated code is as below: ``` Byte512Vector.lane(63) mov z17.d, z16.d ext z17.b, z17.b, z17.b, #63 smov x10, v17.b[0] ``` 2. Vector.withLane(int i, E e) (Replaces the lane element of this vector at lane index i with value e) For 64/128-bit vector, insert operation could be implemented by NEON instructions to get better performance. E.g., for IntVector.SPECIES_128, "IntVector.withLane(0, (int)4)" generates code as below: ``` Before: orr w10, wzr, #0x4 index z17.s, #-16, #1 cmpeq p0.s, p7/z, z17.s, #-16 mov z17.d, z16.d mov z17.s, p0/m, w10 After orr w10, wzr, #0x4 mov v16.s[0], w10 ``` This patch also does a small enhancement for vectors whose sizes are greater than 128 bits. It can save 1 "DUP" if the target index is smaller than 32. E.g., For ByteVector.SPECIES_512, "ByteVector.withLane(0, (byte)4)" generates code as below: ``` Before: index z18.b, #0, #1 mov z17.b, #0 cmpeq p0.b, p7/z, z18.b, z17.b mov z17.d, z16.d mov z17.b, p0/m, w16 After: index z17.b, #-16, #1 cmpeq p0.b, p7/z, z17.b, #-16 mov z17.d, z16.d mov z17.b, p0/m, w16 ``` With this patch, we can see up to 200% performance gain for specific vector micro benchmarks in my SVE testing system. [TEST] test/jdk/jdk/incubator/vector, test/hotspot/jtreg/compiler/vectorapi passed without failure. [1] https://www.felixcloutier.com/x86/pextrb:pextrd:pextrq Change-Id: Ic2a48f852011978d0f252db040371431a339d73c
Loading
Please register or sign in to comment