Vectorization related tasks
This page is a TODO list for tasks related to the GCC vectorizer.
- Replace greedy loop SLP discovery with one based on merging nodes starting from single-lane SLP graph matching the SSA graph
- Do single-lane SLP build when analyzing stmts to be vectorized
- Delay vector type assignment to SLP node analysis (vectorizable_*), compute set of vector types and decide on the vector size by evaluating different sets of working combinations
- Complete load/store permutation lowering for loop vectorization
- Make the vectorization factor support fractional poly-ints to implement re-rolling of loops
- Remove if-conversion, replacing it with masking or on-the-fly if-conversion
- Generate code directly from SLP instead of copying the scalar loop and replacing stmts
- Move pattern detection from stmts to SLP
- Make patterns cancelable
- Make x86 gather and scatter use the internal function instead of the builtins representation
- Split more vectorizable_* into analysis and code generation, store analysis data instead of recomputing it
- Code generate unvectorizable (single-lane) SLP instances by duplicating the scalar code implementing partial loop vectorization, with no vectorization this implements unrolling + interleaving (plus costing)
Specific PRs
Old content below
Here is the summary of the Loop-Optimizations BOF that took place at the 2007 GCC summit.
Todo:
SLP group size relaxation: vectorize only a subset of interleaved stores or split large groups in subgroups if necessary (PR 49955).
- Enabling the cost model by default (currently enabled only on x86).
Interleaved stores with gaps: support interleaved stores to non contiguous memory locations (i.e. with gaps). Related PRs: PR18438, PR19049.
- Interleaving improvements: extend interleaving support to more forms of strided accesses (e.g. non power-of-2 strides).
- Support certain operations on data-types that are not directly supported by a target, but yet vectorization is possible. For example, support data movements and bitwise operations on 64-bit data types for altivec). (TODO: check if this is still needed).
- Vectorize instructions that operate on a sequence of bytes in memory, which means that they implement semantics that corresponds to code containing a loop in C (such as those available in S390).
Improve debug information (mostly line-number information) for code created by the vectorizer (see http://gcc.gnu.org/ml/gcc-patches/2005-02/msg00197.html). (TODO: check if this is still needed).
Reuse generic loop peeling utilities in the vectorizer where possible (see http://gcc.gnu.org/ml/gcc-patches/2005-02/msg00165.html).
- Data Dependence enhancements:
- Loop-number-of-iterations enhancements:
- make gimplifier create COND_EXPR (Zdenek has an initial patch).
- look into vectorizing Fortran COMMON block arrays better.
look into altivec specific problems (PR32107).
- Loop-aware SLP:
- Non-isomorphic computations: the current implementation does not address the case in which the GS is greater than VS and not all the elements of the group are defined by isomorphic computations, but there exists a subgroup of VS elements that are defined by isomorphic computations. Now we attempt to construct the SLP-tree from the entire group, and will therefore fail and terminate. However, the analysis can continue if the implementation is extended to explore subgroups of size VS of the SLP group under consideration.
- Allow shifts with different scalar arguments, when the statements that are grouped into the same vector statement have the same argument.