`<algorithm>`: vectorize `min_element` / `max_element` using SSE4.1/AVX2 for integers

With contiguous memory range and the default predicate (a flavor of `less`) or no predicate it can be vectorized.
SSE4.1 `max_element` algorithm. 

Vertical part:
 * `_mm_cmpgt_epi` to compute mask of greater value of current against previous
 * `_mm_add_epi` to count indices of the current value (add precomputed vector `_mm_set1_epi` with 1)
 * `_mm_blendv_epi8` to select indices of greater value into `_Cur_idx_max`
 * `_mm_max_epi` to update previous maximum, or alternatively can use `_mm_blendv_epi8` again (for 64-bit where there's no max)
 * loop until the end or until some number to make sure accumulated `_Cur_idx_max` does not overflow

Horizontal part:
 * out of the vector, select greatest value by shuffling and comparing with itself with `_mm_max_epi`
 * find mask of greatest value by `_mm_cmpeq_epi` with maximum
 * use mask of greatest value with `_mm_blendv_epi8` on `_Cur_idx_max` to get vector of indices of greater value or all FFs for not the greatest indices
 * out of indices vector find smallest indices by shuffling and comparing with itself with  `_mm_min_epu`
 * find smallest indices mask. Bitwise-and this mask with the previous indices mask to handle the case if all FFs is actually the index.
 * Extract the mask to GPR and `_BitScanForward` into `_H_pos`
 * knowing index of greater element in vector, extract its index `_Cur_idx_max` into `_V_pos`
 * the answer is `_V_pos*16 + _Result_index` relative to the starting pointer in bytes
 * if loop stopped to prevent `_Result_positions` overflow, start again with previous maximum already initialized

(For 64-bit vector it takes `_mm_cmpgt_epi64` from SSE4.2 actually, but we don't distinguish SSE4.1 and SSE4.2 anyway)

It all complex due to that algorithms return iterator, not value. Still the vectorization should provide perf improvement.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`<algorithm>`: vectorize `min_element` / `max_element` using SSE4.1/AVX2 for integers #2438

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

<algorithm>: vectorize min_element / max_element using SSE4.1/AVX2 for integers #2438

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`<algorithm>`: vectorize `min_element` / `max_element` using SSE4.1/AVX2 for integers #2438