With contiguous memory range and the default predicate (a flavor of less) or no predicate it can be vectorized.
SSE4.1 max_element algorithm.
Vertical part:
_mm_cmpgt_epi to compute mask of greater value of current against previous
_mm_add_epi to count indices of the current value (add precomputed vector _mm_set1_epi with 1)
_mm_blendv_epi8 to select indices of greater value into _Cur_idx_max
_mm_max_epi to update previous maximum, or alternatively can use _mm_blendv_epi8 again (for 64-bit where there's no max)
- loop until the end or until some number to make sure accumulated
_Cur_idx_max does not overflow
Horizontal part:
- out of the vector, select greatest value by shuffling and comparing with itself with
_mm_max_epi
- find mask of greatest value by
_mm_cmpeq_epi with maximum
- use mask of greatest value with
_mm_blendv_epi8 on _Cur_idx_max to get vector of indices of greater value or all FFs for not the greatest indices
- out of indices vector find smallest indices by shuffling and comparing with itself with
_mm_min_epu
- find smallest indices mask. Bitwise-and this mask with the previous indices mask to handle the case if all FFs is actually the index.
- Extract the mask to GPR and
_BitScanForward into _H_pos
- knowing index of greater element in vector, extract its index
_Cur_idx_max into _V_pos
- the answer is
_V_pos*16 + _Result_index relative to the starting pointer in bytes
- if loop stopped to prevent
_Result_positions overflow, start again with previous maximum already initialized
(For 64-bit vector it takes _mm_cmpgt_epi64 from SSE4.2 actually, but we don't distinguish SSE4.1 and SSE4.2 anyway)
It all complex due to that algorithms return iterator, not value. Still the vectorization should provide perf improvement.
With contiguous memory range and the default predicate (a flavor of
less) or no predicate it can be vectorized.SSE4.1
max_elementalgorithm.Vertical part:
_mm_cmpgt_epito compute mask of greater value of current against previous_mm_add_epito count indices of the current value (add precomputed vector_mm_set1_epiwith 1)_mm_blendv_epi8to select indices of greater value into_Cur_idx_max_mm_max_epito update previous maximum, or alternatively can use_mm_blendv_epi8again (for 64-bit where there's no max)_Cur_idx_maxdoes not overflowHorizontal part:
_mm_max_epi_mm_cmpeq_epiwith maximum_mm_blendv_epi8on_Cur_idx_maxto get vector of indices of greater value or all FFs for not the greatest indices_mm_min_epu_BitScanForwardinto_H_pos_Cur_idx_maxinto_V_pos_V_pos*16 + _Result_indexrelative to the starting pointer in bytes_Result_positionsoverflow, start again with previous maximum already initialized(For 64-bit vector it takes
_mm_cmpgt_epi64from SSE4.2 actually, but we don't distinguish SSE4.1 and SSE4.2 anyway)It all complex due to that algorithms return iterator, not value. Still the vectorization should provide perf improvement.