THe implementation should be copied from zsl.cpp as a single function, similar to clip_compare_text_and_image, and it should implement the logic end-to-end.
Bonus: It can support multi-label scheme as HuggingFace's ZSL pipeline, e.g., do not squash all the scores into softmax.