Skip to content

ner analyzer changes , tokenizer initialization updated .#152

Merged
lalitpagaria merged 5 commits intoobsei:masterfrom
akar5h:akarsh_neranalyzer
Jun 30, 2021
Merged

ner analyzer changes , tokenizer initialization updated .#152
lalitpagaria merged 5 commits intoobsei:masterfrom
akar5h:akarsh_neranalyzer

Conversation

@akar5h
Copy link
Copy Markdown
Contributor

@akar5h akar5h commented Jun 27, 2021

Apart from the tokenizer fix , I also observed one more thing, in the documentation of configure Ner Analyzer it is mentioned that "# NER analyzer does not need configuration settings" .
But while predicting the NER tags using NERAnalyzer's analyze_input method, analyzer_config is required (below). Hence the NER tagging fails.

        if analyzer_config is None:
             raise ValueError("analyzer_config can't be None")

So two things can be done, either we mention config setting for analyzer
or fix the batch size 1 in code,
I have done a quick fix here but I need your comment so I can make a cleaner fix. Either make the ner's analyze_input batch free or define a config and a batch size for ners .

@akar5h
Copy link
Copy Markdown
Contributor Author

akar5h commented Jun 27, 2021

fix for #151

tokenizer = None
tokenizer = tokenizer = AutoTokenizer.from_pretrained(
self.model_name_or_path, use_fast=True
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's simplify whole code section as follows -

tokenizer = AutoTokenizer.from_pretrained(
          self.tokenizer_name if self.tokenizer_name else self.model_name_or_path, use_fast=True
)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is much cleaner

raise ValueError("analyzer_config can't be None")
# if analyzer_config is None:
# raise ValueError("analyzer_config can't be None")
ner_batch_size = 1
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Thanks for pointing it out.

Ideally batch_size should be part of Analyzer. Hence in this case BaseAnalyzer is correct place for this.
Let's movebatch_size from analyzer_confg to BaseAnalyzer and correct this anomaly.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isnt BaseAnalyzerConfig is inherited by other AnalyzerConfig class meant for configurations like batch size or labels.
By moving batch_size to BaseAnalyzer will be more unclean ,

I suggest that we can add a BaseAnalyzerConfig as a class variable with batch size 1 in NerAnalyzer , this will be a cleaner way.

class NERAnalyzer(BaseAnalyzer):
    _pipeline: Pipeline = PrivateAttr()
    _max_length: int = PrivateAttr()
    TYPE: str = "NER"
    model_name_or_path: str
    tokenizer_name: Optional[str] = None
    grouped_entities: Optional[bool] = True

    ner_config : BaseAnalyzerConfig = BaseAnalyzerConfig(batch_size=1)   # This line

and use this ner config in place of analyzer_config in analyze_input

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

batch_size is more of infra property instead of configuration ie labels, hence better suited near model name and other parameters. For example if transformers model inherent provide batching then Obsei might skip batching at all. Also value of batch size will be depend on where we are running model via analyzer ie system of high or low memory?

Let's assume ClassificationAnalyzer running on CPU and GPU will use different batch_size but same analyzer_config.

Copy link
Copy Markdown
Collaborator

@lalitpagaria lalitpagaria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@akar5h Thank you for working on it.
I have added few comments. Can you please address them.

@lalitpagaria lalitpagaria linked an issue Jun 28, 2021 that may be closed by this pull request
)
else:
tokenizer = None
tokenizer = AutoTokenizer.from_pretrained(self.tokenizer_name if self.tokenizer_name else self.model_name_or_path
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please run black formatting on this code.

black refactor of nerAnalyzer (& other changes),

Also removed the batch_size from BaseAnalyzerConfig, and moved to BaseAnalyzer,
changed the uses of batch_size wherever required .
@akar5h
Copy link
Copy Markdown
Contributor Author

akar5h commented Jun 29, 2021

@lalitpagaria I saw your comments and worked on them, batch size was moved to BaseAnalyzer, and ran the black refactoring. Any more suggestions ? Also Ill start looking into the workflow build error if it persists later tonight.

Also, this is the first time I'm working on open source so will take a bit time to get the hang of the workflow and processes involved.

@lalitpagaria
Copy link
Copy Markdown
Collaborator

@akar5h Thank you very much for choosing Obsei for your first contribution. Don't worry about workflows, you can learn slowly, they make PR reviewer's and maintainer's life easy :)

BTW I have reviewed your PR again and it is in very good shape. I few these two comments -

  1. Type checking is failing, refer this https://github.com/lalitpagaria/obsei/pull/152/checks?check_run_id=2945779483. I think you nee to keep this line as it is https://github.com/lalitpagaria/obsei/pull/152/files#diff-71ab2e5daf0fed8592831d6454f6e4087546f1684a41974aa09ece24f1cbfa06L78
  2. Can you please remove batch_size from BaseAnalyzerConfig

Please let me know once you resolve these I will merge your PR.

@akar5h
Copy link
Copy Markdown
Contributor Author

akar5h commented Jun 30, 2021

@lalitpagaria I have made these changes. Thanks for pointing in the right direction

Copy link
Copy Markdown
Collaborator

@lalitpagaria lalitpagaria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@akar5h Changes looks good to me.
Thank you for working on it and choosing Obsei for your first open source contribution.

I will merge after CI jobs completion.

@lalitpagaria
Copy link
Copy Markdown
Collaborator

@akar5h it seems with batch_size removal from BaseAnalyzerConfig, all tests are failing. Is it possible for you to fix them?

@akar5h
Copy link
Copy Markdown
Contributor Author

akar5h commented Jun 30, 2021

Hi , @lalitpagaria I can try and fix them . Is it also possible for me to run the CI tests ? It would easier for me to debug that way and it would not be such a back and forth process, I just test them on some simple examples on my local rn,

@akar5h
Copy link
Copy Markdown
Contributor Author

akar5h commented Jun 30, 2021

Ill refer to the logs and fix them

@lalitpagaria
Copy link
Copy Markdown
Collaborator

lalitpagaria commented Jun 30, 2021

@akar5h it seems contribution guideline is missing this information. I will add this when find time. Meanwhile you can locally run tests as follows

pip install pytest
cd test
pytest

@akar5h
Copy link
Copy Markdown
Contributor Author

akar5h commented Jun 30, 2021

@lalitpagaria thanks for the pytest tip, the whole testing part became more clear now,

The same batch_size changes were required to be done in translation_analyzer .
Also can you review changes in test scripts . earlier batch_size was passed as a BaseAnalyzerConfig object in test files, that was also changed . And in the NER test files,
One of the test contained name "Lalit" which the tokenizer split into Lal## , ##it , giving them both "PER" tag. Now to make the test consistent across different tokenizer , I changed the test input. Can you review that too .

With these changes all the tests are passing

Copy link
Copy Markdown
Collaborator

@lalitpagaria lalitpagaria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! LGTM!

Thanks for finding the issue, fixing it and improving current tests

@lalitpagaria lalitpagaria merged commit 91d6f4c into obsei:master Jun 30, 2021
@akar5h akar5h deleted the akarsh_neranalyzer branch July 4, 2021 11:20
@lalitpagaria lalitpagaria added the enhancement New feature or request label Oct 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] tokenizer loading error the NER Analyzer

2 participants