Skip to content

add preprocessor/cleaner#110

Merged
lalitpagaria merged 24 commits intoobsei:masterfrom
shahrukhx01:add_preprocessor
Jun 2, 2021
Merged

add preprocessor/cleaner#110
lalitpagaria merged 24 commits intoobsei:masterfrom
shahrukhx01:add_preprocessor

Conversation

@shahrukhx01
Copy link
Copy Markdown
Collaborator

Could you review the structure of the code, is it as per your expectation and style guide we have for this repo? I'll then start adding functions to this and keep updating this PR. thanks

issues: #75

@lalitpagaria
Copy link
Copy Markdown
Collaborator

@shahrukhx01 Thank you for working on it.
There is small comment about design otherwise rest is fine.

@shahrukhx01
Copy link
Copy Markdown
Collaborator Author

shahrukhx01 commented May 27, 2021

I have added the baseline pipeline, following are my comments about what I did and what I skipped:

  1. lower casing (if possible extra care of Named Entity for example Bush is person and bush is word hence no point is lowering Bush if it appear in sentance)
    Done
  2. stop word removal, stemming, punctuation removal (but it should keep sentence as it is ie not return tokens)
    Done
  3. Excessive white space remove
    Done
  4. link removal or adding filler
    not started
  5. Hashtags: remove hashtag or remove only # or replace them with some filer
    Done -- removed #
  6. UserTagging (@someuser): remove user tag or remove only @ or replace them with some filer
    Done -- removed @
  7. Spelling/grammer correction (if it is heavy model like transformers then it should go to Analyzer)
    Could mislead the classifier -- most deep learning models are robust against such cases.
  8. User provided list of regex (and corresponding substitute) in case he want to customise cleaning
    not started
  9. Decoding Unicode characters into a normalized form, such as UTF8
    Done
  10. Handling of domain specific words, phrases, and acronyms
    not started
  11. Handling or removing numbers, such as dates and amounts
    not started
  12. Locating and correcting common typos and misspellings --> Point 10 will handle synonyms and acronyms, spell correction may end slowing down the entire pipeline as we'd have to lookup for each token in a list of dictionary
    Could mislead the classifier -- most deep learning models are robust against such cases.
  13. Transliteration of characters from other languages into one fixed Language (if it is heavy model like transformers then it should go to Analyzer)
    DL models won't get interfered by such characters in general
  14. Cleaning up of alpha numeric words
    Alphanumeric tokens sometimes convey important info. Such as in addresses etc.

@shahrukhx01
Copy link
Copy Markdown
Collaborator Author

I'll continue on adding the rest of the functions, let me know about the design and styling when you get time to see it

@shahrukhx01
Copy link
Copy Markdown
Collaborator Author

I'm more or less done with the first iteration of text cleaner, could you please review the code now.

@lalitpagaria
Copy link
Copy Markdown
Collaborator

@shahrukhx01 Your PR is in good shape. Few points -

  1. How about having one meeting so I can explain all design considerations?
  2. Is it possible to add tests, I know that test coverage is not good we need it slowly
  3. Can you please rebase with latest master changes

@shahrukhx01
Copy link
Copy Markdown
Collaborator Author

  1. How about having one meeting so I can explain all design considerations?

already texted on Slack about that

  1. Is it possible to add tests, I know that test coverage is not good we need it slowly

sure, once you are happy with the final code, I'll start doing that.

  1. Can you please rebase with latest master changes

will do.

@shahrukhx01
Copy link
Copy Markdown
Collaborator Author

shahrukhx01 commented May 30, 2021

Could you please have a look now? I was thinking about how the end-user would pick the cleaning functions, I ended up with the current design, using the enum, the user won't make mistakes and would be able to see the possible functions available, also it makes the code look more elegant. Now the cleaning pipeline can be executed something like this, what do you think?

from obsei.preprocessor.base_text_cleaner import CleaningFunctions as clean_funcs

request = AnalyzerRequest(
    processed_text="Peter drinks likely likes to tea at 16:45 #datascience @shahrukh "
)

conf = BaseTextProcessorConfig(
    text_cleaning_functions=[
        clean_funcs.to_lower_case,
        clean_funcs.remove_stop_words,
        clean_funcs.stem_text,
    ]
)
print(TextCleaner().clean_input(config=conf, input_list=[request]))

@lalitpagaria
Copy link
Copy Markdown
Collaborator

Hi @shahrukhx01
Apologies for late reply.
I will check today and give feedback. I was thinking that we could do it during our meeting, but I understand better to do it before that.

@lalitpagaria
Copy link
Copy Markdown
Collaborator

@shahrukhx01 Thank you very much. You PR is in very good shape now. There are few comments, which we can discuss on our call. Sometime it is difficult to explain in text :)

@shahrukhx01
Copy link
Copy Markdown
Collaborator Author

@lalitpagaria I have made the changes added the tests. Please review when you get time.

@lalitpagaria
Copy link
Copy Markdown
Collaborator

@shahrukhx01 Thanks I will review it today and share feedback.

@lalitpagaria
Copy link
Copy Markdown
Collaborator

@shahrukhx01 I added few changes to your PR based on future design consideration.
Overall you did excellent work, I liked your test cases as you have covered major scenarios.
Even it uncovered one bug in the code. I am merging your PR.
Thank you for your contribution. 🙏

Copy link
Copy Markdown
Collaborator

@lalitpagaria lalitpagaria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lalitpagaria lalitpagaria merged commit ca1f8ec into obsei:master Jun 2, 2021
@lalitpagaria lalitpagaria linked an issue Jun 2, 2021 that may be closed by this pull request
@lalitpagaria lalitpagaria changed the title WIP: add preprocessor/cleaner add preprocessor/cleaner Jun 2, 2021
@lalitpagaria lalitpagaria mentioned this pull request Jun 2, 2021
@lalitpagaria lalitpagaria added the enhancement New feature or request label Oct 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add text cleaner node

2 participants