ner analyzer changes , tokenizer initialization updated .#152
ner analyzer changes , tokenizer initialization updated .#152lalitpagaria merged 5 commits intoobsei:masterfrom
Conversation
|
fix for #151 |
obsei/analyzer/ner_analyzer.py
Outdated
| tokenizer = None | ||
| tokenizer = tokenizer = AutoTokenizer.from_pretrained( | ||
| self.model_name_or_path, use_fast=True | ||
| ) |
There was a problem hiding this comment.
Let's simplify whole code section as follows -
tokenizer = AutoTokenizer.from_pretrained(
self.tokenizer_name if self.tokenizer_name else self.model_name_or_path, use_fast=True
)
There was a problem hiding this comment.
Yes, this is much cleaner
obsei/analyzer/ner_analyzer.py
Outdated
| raise ValueError("analyzer_config can't be None") | ||
| # if analyzer_config is None: | ||
| # raise ValueError("analyzer_config can't be None") | ||
| ner_batch_size = 1 |
There was a problem hiding this comment.
Interesting. Thanks for pointing it out.
Ideally batch_size should be part of Analyzer. Hence in this case BaseAnalyzer is correct place for this.
Let's movebatch_size from analyzer_confg to BaseAnalyzer and correct this anomaly.
There was a problem hiding this comment.
Isnt BaseAnalyzerConfig is inherited by other AnalyzerConfig class meant for configurations like batch size or labels.
By moving batch_size to BaseAnalyzer will be more unclean ,
I suggest that we can add a BaseAnalyzerConfig as a class variable with batch size 1 in NerAnalyzer , this will be a cleaner way.
class NERAnalyzer(BaseAnalyzer):
_pipeline: Pipeline = PrivateAttr()
_max_length: int = PrivateAttr()
TYPE: str = "NER"
model_name_or_path: str
tokenizer_name: Optional[str] = None
grouped_entities: Optional[bool] = True
ner_config : BaseAnalyzerConfig = BaseAnalyzerConfig(batch_size=1) # This line
and use this ner config in place of analyzer_config in analyze_input
There was a problem hiding this comment.
batch_size is more of infra property instead of configuration ie labels, hence better suited near model name and other parameters. For example if transformers model inherent provide batching then Obsei might skip batching at all. Also value of batch size will be depend on where we are running model via analyzer ie system of high or low memory?
Let's assume ClassificationAnalyzer running on CPU and GPU will use different batch_size but same analyzer_config.
lalitpagaria
left a comment
There was a problem hiding this comment.
@akar5h Thank you for working on it.
I have added few comments. Can you please address them.
obsei/analyzer/ner_analyzer.py
Outdated
| ) | ||
| else: | ||
| tokenizer = None | ||
| tokenizer = AutoTokenizer.from_pretrained(self.tokenizer_name if self.tokenizer_name else self.model_name_or_path |
black refactor of nerAnalyzer (& other changes), Also removed the batch_size from BaseAnalyzerConfig, and moved to BaseAnalyzer, changed the uses of batch_size wherever required .
|
@lalitpagaria I saw your comments and worked on them, batch size was moved to BaseAnalyzer, and ran the black refactoring. Any more suggestions ? Also Ill start looking into the workflow build error if it persists later tonight. Also, this is the first time I'm working on open source so will take a bit time to get the hang of the workflow and processes involved. |
|
@akar5h Thank you very much for choosing Obsei for your first contribution. Don't worry about workflows, you can learn slowly, they make PR reviewer's and maintainer's life easy :) BTW I have reviewed your PR again and it is in very good shape. I few these two comments -
Please let me know once you resolve these I will merge your PR. |
|
@lalitpagaria I have made these changes. Thanks for pointing in the right direction |
lalitpagaria
left a comment
There was a problem hiding this comment.
@akar5h Changes looks good to me.
Thank you for working on it and choosing Obsei for your first open source contribution.
I will merge after CI jobs completion.
|
@akar5h it seems with batch_size removal from BaseAnalyzerConfig, all tests are failing. Is it possible for you to fix them? |
|
Hi , @lalitpagaria I can try and fix them . Is it also possible for me to run the CI tests ? It would easier for me to debug that way and it would not be such a back and forth process, I just test them on some simple examples on my local rn, |
|
Ill refer to the logs and fix them |
|
@akar5h it seems contribution guideline is missing this information. I will add this when find time. Meanwhile you can locally run tests as follows |
|
@lalitpagaria thanks for the pytest tip, the whole testing part became more clear now, The same batch_size changes were required to be done in With these changes all the tests are passing |
lalitpagaria
left a comment
There was a problem hiding this comment.
Great! LGTM!
Thanks for finding the issue, fixing it and improving current tests
Apart from the tokenizer fix , I also observed one more thing, in the documentation of configure Ner Analyzer it is mentioned that "# NER analyzer does not need configuration settings" .
But while predicting the NER tags using NERAnalyzer's analyze_input method, analyzer_config is required (below). Hence the NER tagging fails.
So two things can be done, either we mention config setting for analyzer
or fix the batch size 1 in code,
I have done a quick fix here but I need your comment so I can make a cleaner fix. Either make the ner's analyze_input batch free or define a config and a batch size for ners .