Skip to content

[BUG] tokenizer loading error the NER Analyzer #151

@akar5h

Description

@akar5h

Describe the bug
The tokenizer is not loaded while loading the NER Analyzer , causing exception stated below:
Exception: Impossible to guess which tokenizer to use. Please provide a PreTrainedTokenizer class or a path/identifier to a pretrained tokenizer.

I also checked out the NERAnalyzer class and I might have a fix as well

To Reproduce
Picked this snipped from documentation under "Step 4: Configure Analyzer" >> "NER Analyzer"

from obsei.analyzer.ner_analyzer import NERAnalyzer

# NER analyzer does not need configuration settings
analyzer_config=None

# initialize ner analyzer
# For supported models refer https://huggingface.co/models?filter=token-classification
text_analyzer = NERAnalyzer(
   model_name_or_path="elastic/distilbert-base-cased-finetuned-conll03-english",
   device = "auto"
)

It shows the exception stated above,

Running on google colab ,
OS ="Ubuntu"
VERSION="18.04.5 LTS (Bionic Beaver)"

Additional context
I think while initializing the class NER Analyzer , the tokenizer is not initialized , ans is set to None

class NERAnalyzer(BaseAnalyzer):
    _pipeline: Pipeline = PrivateAttr()
    _max_length: int = PrivateAttr()
    TYPE: str = "NER"
    model_name_or_path: str
    tokenizer_name: Optional[str] = None
    grouped_entities: Optional[bool] = True

    def __init__(self, **data: Any):
        super().__init__(**data)

        model = AutoModelForTokenClassification.from_pretrained(self.model_name_or_path)
        if self.tokenizer_name:
            tokenizer = AutoTokenizer.from_pretrained(
                self.tokenizer_name, use_fast=True
            )
        else:
            tokenizer = None

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions