CodeBERT support for embeddings#488
Conversation
|
Thanks for the PR. CodeBert looks really interesting to us. Especially the code search functionality. But does the code run for all cases of codebert? So the code search functionality seems like textpairclassification, which is possible with current FARM and for this use case we should be able to load codebert. The MLM objective for "code docstring generation" is something we would need to implement in FARM. What are your thoughts on that? |
|
I only checked the model for classification because I was interested to use it for retrieval (actually I used haystack). The docstring generation would be pretty limited given that there is only model for MLM (I don't think it can handle variable length output), and Microsoft doesn't plan to release model that can be actually used for text generation (see this issue) If you have spare compute I someone (for example me) could actually finetune CodeBERT for docstring generation, but for now at least it's pretty hard given that it was trained on 4 P40 GPUs and the data preprocessing step doesn't fit in 32GB RAM. |
|
Nice, I am fine with just going for the textpairclassification case. Will it create problems if we load the |
|
I've checked the |
|
Hey, thanks a lot for checking. I would still prefer to exclude the mlm model from loading, because it will confuse people loading mlm and doing classification. |
|
Changed. |
Timoeller
left a comment
There was a problem hiding this comment.
Thanks for updating the PR.
I made a comment on the additional changes you made.
farm/modeling/language_model.py
Outdated
| language_model = cls.subclasses[language_model_class].load(pretrained_model_name_or_path, **kwargs) | ||
| else: | ||
| language_model = None | ||
| try: |
There was a problem hiding this comment.
Hey, this seems like a solid idea and we should soon add this functionality of AutoModel, but this implementation will create errors downstream that might be tough for a user to understand.
Each model has to be implemented (Its often not much work) and checked for interoperability with FARMs LanguageModel class.
But I agree, we currently load models based on the model string, which is sub optimal.
We could load models based on the config and only select those that are already implemented in FARM.
If you want to work on this functionality please create a separate PR for this and flag it as work in progress, so we can give early feedback.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
I guess something went wrong with mail reply? In my experience it is nearly always best to use the github UI for answering... |
|
I removed AutoModel functionalities. |
I've added support to CodeBERT models. These models essentially use RoBERTa class and tokenizer, so it was only necessary to load Huggingface model.