Skip to content
Commit cae3b16b authored by Pengfei Li's avatar Pengfei Li
Browse files

8183390: Fix and re-enable post loop vectorization

** Background

Post loop vectorization is a C2 compiler optimization in an experimental
VM feature called PostLoopMultiversioning. It transforms the range-check
eliminated post loop to a 1-iteration vectorized loop with vector mask.
This optimization was contributed by Intel in 2016 to support x86 AVX512
masked vector instructions. However, it was disabled soon after an issue
was found. Due to insufficient maintenance in these years, multiple bugs
have been accumulated inside. But we (Arm) still think this is a useful
framework for vector mask support in C2 auto-vectorized loops, for both
x86 AVX512 and AArch64 SVE. Hence, we propose this to fix and re-enable
post loop vectorization.

** Changes in this patch

This patch reworks post loop vectorization. The most significant change
is removing vector mask support in C2 x86 backend and re-implementing
it in the mid-end. With this, we can re-enable post loop vectorization
for platforms other than x86.

Previous implementation hard-codes x86 k1 register as a reserved AVX512
opmask register and defines two routines (setvectmask/restorevectmask)
to set and restore the value of k1. But after JDK-8211251 which encodes
AVX512 instructions as unmasked by default, generated vector masks are
no longer used in AVX512 vector instructions. To fix incorrect codegen
and add vector mask support for more platforms, we turn to add a vector
mask input to C2 mid-end IRs. Specifically, we use a VectorMaskGenNode
to generate a mask and replace all Load/Store nodes in the post loop
into LoadVectorMasked/StoreVectorMasked nodes with that mask input. This
IR form is exactly the same to those which are used in VectorAPI mask
support. For now, we only add mask inputs for Load/Store nodes because
we don't have reduction operations supported in post loop vectorization.
After this change, the x86 k1 register is no longer reserved and can be
allocated when PostLoopMultiversioning is enabled.

Besides this change, we have fixed a compiler crash and five incorrect
result issues with post loop vectorization.

- 1) C2 crashes with segmentation fault in strip-mined loops

Previous implementation was done before C2 loop strip-mining was merged
into JDK master so it didn't take strip-mined loops into consideration.
In C2's strip mined loops, post loop is not the sibling of the main loop
in ideal loop tree. Instead, it's the sibling of the main loop's parent.
This patch fixed a SIGSEGV issue caused by NULL pointer when locating
post loop from strip-mined main loop.

- 2) Incorrect result issues with post loop vectorization

We have also fixed five incorrect vectorization issues. Some of them are
hidden deep and can only be reproduced with corner cases. These issues
have a common cause that it assumes the post loop can be vectorized if
the vectorization in corresponding main loop is successful. But in many
cases this assumption is wrong. Below are details.

[Issue-1] Incorrect vectorization for partial vectorizable loops

This issue can be reproduced by below loop where only some operations in
the loop body are vectorizable.

  for (int i = 0; i < 10000; i++) {
    res[i] = a[i] * b[i];
    k = 3 * k + 1;
  }

In the main loop, superword can work well if parts of the operations in
loop body are not vectorizable since those parts can be unrolled only.
But for post loops, we don't create vectors through combining scalar IRs
generated from loop unrolling. Instead, we are doing scalars to vectors
replacement for all operations in the loop body. Hence, all operations
should be either vectorized together or not vectorized at all. To fix
this kind of cases, we add an extra field "_slp_vector_pack_count" in
CountedLoopNode to record the eventual count of vector packs in the main
loop. This value is then passed to post loop and compared with post loop
pack count. Vectorization will be bailed out in post loop if it creates
more vector packs than in the main loop.

[Issue-2] Incorrect result in loops with growing-down vectors

This issue appears with growing-down vectors, that is, vectors that grow
to smaller memory address as the loop iterates. It can be reproduced by
below counting-up loop with negative scale value in array index.

  for (int i = 0; i < 10000; i++) {
    a[MAX - i] = b[MAX - i];
  }

Cause of this issue is that for a growing-down vector, generated vector
mask value has reversed vector-lane order so it masks incorrect vector
lanes. Note that if negative scale value appears in counting-down loops,
the vector will be growing up. With this rule, we fix the issue by only
allowing positive array index scales in counting-up loops and negative
array index scales in counting-down loops. This check is done with the
help of SWPointer by comparing scale values in each memory access in the
loop with loop stride value.

[Issue-3] Incorrect result in manually unrolled loops

This issue can be reproduced by below manually unrolled loop.

  for (int i = 0; i < 10000; i += 2) {
    c[i] = a[i] + b[i];
    c[i + 1] = a[i + 1] * b[i + 1];
  }

In this loop, operations in the 2nd statement duplicate those in the 1st
statement with a small memory address offset. Vectorization in the main
loop works well in this case because C2 does further unrolling and pack
combination. But we cannot vectorize the post loop through replacement
from scalars to vectors because it creates duplicated vector operations.
To fix this, we restrict post loop vectorization to loops with stride
values of 1 or -1.

[Issue-4] Incorrect result in loops with mixed vector element sizes

This issue is found after we enable post loop vectorization for AArch64.
It's reproducible by multiple array operations with different element
sizes inside a loop. On x86, there is no issue because the values of x86
AVX512 opmasks only depend on which vector lanes are active. But AArch64
is different - the values of SVE predicates also depend on lane size of
the vector. Hence, on AArch64 SVE, if a loop has mixed vector element
sizes, we should use different vector masks. For now, we just support
loops with only one vector element size, i.e., "int + float" vectors in
a single loop is ok but "int + double" vectors in a single loop is not
vectorizable. This fix also enables subword vectors support to make all
primitive type array operations vectorizable.

[Issue-5] Incorrect result in loops with potential data dependence

This issue can be reproduced by below corner case on AArch64 only.

  for (int i = 0; i < 10000; i++) {
    a[i] = x;
    a[i + OFFSET] = y;
  }

In this case, two stores in the loop have data dependence if the OFFSET
value is smaller than the vector length. So we cannot do vectorization
through replacing scalars to vectors. But the main loop vectorization
in this case is successful on AArch64 because AArch64 has partial vector
load/store support. It splits vector fill with different values in lanes
to several smaller-sized fills. In this patch, we add additional data
dependence check for this kind of cases. The check is also done with the
help of SWPointer class. In this check, we require that every two memory
accesses (with at least one store) of the same element type (or subword
size) in the loop has the same array index expression.

** Tests

So far we have tested full jtreg on both x86 AVX512 and AArch64 SVE with
experimental VM option "PostLoopMultiversioning" turned on. We found no
issue in all tests. We notice that those existing cases are not enough
because some of above issues are not spotted by them. We would like to
add some new cases but we found existing vectorization tests are a bit
cumbersome - golden results must be pre-calculated and hard-coded in the
test code for correctness verification. Thus, in this patch, we propose
a new vectorization testing framework.

Our new framework brings a simpler way to add new cases. For a new test
case, we only need to create a new method annotated with "@test". The
test runner will invoke each annotated method twice automatically. First
time it runs in the interpreter and second time it's forced compiled by
C2. Then the two return results are compared. So in this framework each
test method should return a primitive value or an array of primitives.
In this way, no extra verification code for vectorization correctness is
required. This test runner is still jtreg-based and takes advantages of
the jtreg WhiteBox API, which enables test methods running at specific
compilation levels. Each test class inside is also jtreg-based. It just
need to inherit from the test runner class and run with two additional
options "-Xbootclasspath/a:." and "-XX:+WhiteBoxAPI".

** Summary & Future work

In this patch, we reworked post loop vectorization. We made it platform
independent and fixed several issues inside. We also implemented a new
vectorization testing framework with many test cases inside. Meanwhile,
we did some code cleanups.

This patch only touches C2 code guarded with PostLoopMultiversioning,
except a few data structure changes. So, there's no behavior change when
experimental VM option PostLoopMultiversioning is off. Also, to reduce
risks, we still propose to keep post loop vectorization experimental for
now. But if it receives positive feedback, we would like to change it to
non-experimental in the future.
parent 3f9638d1
Loading
Loading
Loading
Loading
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment