ner analyzer changes , tokenizer initialization updated . by akar5h · Pull Request #152 · obsei/obsei

akar5h · 2021-06-27T12:59:57Z

Apart from the tokenizer fix , I also observed one more thing, in the documentation of configure Ner Analyzer it is mentioned that "# NER analyzer does not need configuration settings" .
But while predicting the NER tags using NERAnalyzer's analyze_input method, analyzer_config is required (below). Hence the NER tagging fails.

        if analyzer_config is None:
             raise ValueError("analyzer_config can't be None")

So two things can be done, either we mention config setting for analyzer
or fix the batch size 1 in code,
I have done a quick fix here but I need your comment so I can make a cleaner fix. Either make the ner's analyze_input batch free or define a config and a batch size for ners .

akar5h · 2021-06-27T13:00:25Z

fix for #151

lalitpagaria · 2021-06-28T03:28:33Z

obsei/analyzer/ner_analyzer.py

-            tokenizer = None
+            tokenizer = tokenizer = AutoTokenizer.from_pretrained(
+                self.model_name_or_path, use_fast=True
+            )


Let's simplify whole code section as follows -

tokenizer = AutoTokenizer.from_pretrained( self.tokenizer_name if self.tokenizer_name else self.model_name_or_path, use_fast=True )

Yes, this is much cleaner

lalitpagaria · 2021-06-28T03:34:17Z

obsei/analyzer/ner_analyzer.py

-            raise ValueError("analyzer_config can't be None")
+        # if analyzer_config is None:
+        #     raise ValueError("analyzer_config can't be None")
+        ner_batch_size = 1


Interesting. Thanks for pointing it out.

Ideally batch_size should be part of Analyzer. Hence in this case BaseAnalyzer is correct place for this.
Let's movebatch_size from analyzer_confg to BaseAnalyzer and correct this anomaly.

Isnt BaseAnalyzerConfig is inherited by other AnalyzerConfig class meant for configurations like batch size or labels.
By moving batch_size to BaseAnalyzer will be more unclean ,

I suggest that we can add a BaseAnalyzerConfig as a class variable with batch size 1 in NerAnalyzer , this will be a cleaner way.

class NERAnalyzer(BaseAnalyzer): _pipeline: Pipeline = PrivateAttr() _max_length: int = PrivateAttr() TYPE: str = "NER" model_name_or_path: str tokenizer_name: Optional[str] = None grouped_entities: Optional[bool] = True ner_config : BaseAnalyzerConfig = BaseAnalyzerConfig(batch_size=1) # This line

and use this ner config in place of analyzer_config in analyze_input

batch_size is more of infra property instead of configuration ie labels, hence better suited near model name and other parameters. For example if transformers model inherent provide batching then Obsei might skip batching at all. Also value of batch size will be depend on where we are running model via analyzer ie system of high or low memory?

Let's assume ClassificationAnalyzer running on CPU and GPU will use different batch_size but same analyzer_config.

lalitpagaria

@akar5h Thank you for working on it.
I have added few comments. Can you please address them.

lalitpagaria · 2021-06-28T17:44:49Z

obsei/analyzer/ner_analyzer.py

-            )
-        else:
-            tokenizer = None
+        tokenizer = AutoTokenizer.from_pretrained(self.tokenizer_name if self.tokenizer_name else self.model_name_or_path


please run black formatting on this code.

black refactor of nerAnalyzer (& other changes), Also removed the batch_size from BaseAnalyzerConfig, and moved to BaseAnalyzer, changed the uses of batch_size wherever required .

akar5h · 2021-06-29T12:26:40Z

@lalitpagaria I saw your comments and worked on them, batch size was moved to BaseAnalyzer, and ran the black refactoring. Any more suggestions ? Also Ill start looking into the workflow build error if it persists later tonight.

Also, this is the first time I'm working on open source so will take a bit time to get the hang of the workflow and processes involved.

lalitpagaria · 2021-06-29T19:37:37Z

@akar5h Thank you very much for choosing Obsei for your first contribution. Don't worry about workflows, you can learn slowly, they make PR reviewer's and maintainer's life easy :)

BTW I have reviewed your PR again and it is in very good shape. I few these two comments -

Type checking is failing, refer this https://github.com/lalitpagaria/obsei/pull/152/checks?check_run_id=2945779483. I think you nee to keep this line as it is https://github.com/lalitpagaria/obsei/pull/152/files#diff-71ab2e5daf0fed8592831d6454f6e4087546f1684a41974aa09ece24f1cbfa06L78
Can you please remove batch_size from BaseAnalyzerConfig

Please let me know once you resolve these I will merge your PR.

akar5h · 2021-06-30T16:48:06Z

@lalitpagaria I have made these changes. Thanks for pointing in the right direction

lalitpagaria

@akar5h Changes looks good to me.
Thank you for working on it and choosing Obsei for your first open source contribution.

I will merge after CI jobs completion.

lalitpagaria · 2021-06-30T17:03:56Z

@akar5h it seems with batch_size removal from BaseAnalyzerConfig, all tests are failing. Is it possible for you to fix them?

akar5h · 2021-06-30T17:09:08Z

Hi , @lalitpagaria I can try and fix them . Is it also possible for me to run the CI tests ? It would easier for me to debug that way and it would not be such a back and forth process, I just test them on some simple examples on my local rn,

akar5h · 2021-06-30T17:10:29Z

Ill refer to the logs and fix them

lalitpagaria · 2021-06-30T17:19:51Z

@akar5h it seems contribution guideline is missing this information. I will add this when find time. Meanwhile you can locally run tests as follows

pip install pytest
cd test
pytest

akar5h · 2021-06-30T19:12:44Z

@lalitpagaria thanks for the pytest tip, the whole testing part became more clear now,

The same batch_size changes were required to be done in translation_analyzer .
Also can you review changes in test scripts . earlier batch_size was passed as a BaseAnalyzerConfig object in test files, that was also changed . And in the NER test files,
One of the test contained name "Lalit" which the tokenizer split into Lal## , ##it , giving them both "PER" tag. Now to make the test consistent across different tokenizer , I changed the test input. Can you review that too .

With these changes all the tests are passing

lalitpagaria

Great! LGTM!

Thanks for finding the issue, fixing it and improving current tests

ner analyzer changes , tokenizer updated .

c54749f

lalitpagaria reviewed Jun 28, 2021

View reviewed changes

lalitpagaria requested changes Jun 28, 2021

View reviewed changes

lalitpagaria linked an issue Jun 28, 2021 that may be closed by this pull request

[BUG] tokenizer loading error the NER Analyzer #151

Closed

obsei#152 update

d7f0dd8

lalitpagaria reviewed Jun 28, 2021

View reviewed changes

batch_size added to BaseAnalyzer class

81b285d

black refactor of nerAnalyzer (& other changes), Also removed the batch_size from BaseAnalyzerConfig, and moved to BaseAnalyzer, changed the uses of batch_size wherever required .

removed batch size from base analyzer, ner anayzer refactor

cc04617

lalitpagaria approved these changes Jun 30, 2021

View reviewed changes

changes

bdb4230

lalitpagaria approved these changes Jun 30, 2021

View reviewed changes

lalitpagaria merged commit 91d6f4c into obsei:master Jun 30, 2021

akar5h deleted the akarsh_neranalyzer branch July 4, 2021 11:20

lalitpagaria added the enhancement New feature or request label Oct 5, 2021

Conversation

akar5h commented Jun 27, 2021

Uh oh!

akar5h commented Jun 27, 2021

Uh oh!

lalitpagaria Jun 28, 2021

Choose a reason for hiding this comment

Uh oh!

akar5h Jun 28, 2021

Choose a reason for hiding this comment

Uh oh!

lalitpagaria Jun 28, 2021

Choose a reason for hiding this comment

Uh oh!

akar5h Jun 28, 2021

Choose a reason for hiding this comment

Uh oh!

lalitpagaria Jun 28, 2021

Choose a reason for hiding this comment

Uh oh!

lalitpagaria left a comment

Choose a reason for hiding this comment

Uh oh!

lalitpagaria Jun 28, 2021

Choose a reason for hiding this comment

Uh oh!

akar5h commented Jun 29, 2021

Uh oh!

lalitpagaria commented Jun 29, 2021

Uh oh!

akar5h commented Jun 30, 2021

Uh oh!

lalitpagaria left a comment

Choose a reason for hiding this comment

Uh oh!

lalitpagaria commented Jun 30, 2021

Uh oh!

akar5h commented Jun 30, 2021

Uh oh!

akar5h commented Jun 30, 2021

Uh oh!

lalitpagaria commented Jun 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

akar5h commented Jun 30, 2021

Uh oh!

lalitpagaria left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lalitpagaria commented Jun 30, 2021 •

edited

Loading