There seems to miss a type conversion in the forward process of VisionTransformer, in clip/model.py. The direct forward pass without pre-conversion (Line 342) would cause error of type mismatch.
As a reference, in Line 146 there is an explicit type conversion in ModifiedResNet.
Is it possible to add conversion x = x.type(self.conv1.weight.dtype) between Line 223-224?