add preprocessor/cleaner by shahrukhx01 · Pull Request #110 · obsei/obsei

shahrukhx01 · 2021-05-27T15:27:02Z

Could you review the structure of the code, is it as per your expectation and style guide we have for this repo? I'll then start adding functions to this and keep updating this PR. thanks

issues: #75

obsei/preprocessor/base_text_cleaner.py

lalitpagaria · 2021-05-27T17:59:19Z

@shahrukhx01 Thank you for working on it.
There is small comment about design otherwise rest is fine.

shahrukhx01 · 2021-05-27T23:34:15Z

I have added the baseline pipeline, following are my comments about what I did and what I skipped:

lower casing (if possible extra care of Named Entity for example Bush is person and bush is word hence no point is lowering Bush if it appear in sentance)
Done
stop word removal, stemming, punctuation removal (but it should keep sentence as it is ie not return tokens)
Done
Excessive white space remove
Done
link removal or adding filler
not started
Hashtags: remove hashtag or remove only # or replace them with some filer
Done -- removed #
UserTagging (@someuser): remove user tag or remove only @ or replace them with some filer
Done -- removed @
Spelling/grammer correction (if it is heavy model like transformers then it should go to Analyzer)
Could mislead the classifier -- most deep learning models are robust against such cases.
User provided list of regex (and corresponding substitute) in case he want to customise cleaning
not started
Decoding Unicode characters into a normalized form, such as UTF8
Done
Handling of domain specific words, phrases, and acronyms
not started
Handling or removing numbers, such as dates and amounts
not started
Locating and correcting common typos and misspellings --> Point 10 will handle synonyms and acronyms, spell correction may end slowing down the entire pipeline as we'd have to lookup for each token in a list of dictionary
Could mislead the classifier -- most deep learning models are robust against such cases.
Transliteration of characters from other languages into one fixed Language (if it is heavy model like transformers then it should go to Analyzer)
DL models won't get interfered by such characters in general
Cleaning up of alpha numeric words
Alphanumeric tokens sometimes convey important info. Such as in addresses etc.

shahrukhx01 · 2021-05-27T23:36:12Z

I'll continue on adding the rest of the functions, let me know about the design and styling when you get time to see it

obsei/preprocessor/text_cleaner.py

requirements.txt

… into add_preprocessor delete .vscode directory

shahrukhx01 · 2021-05-29T12:47:16Z

I'm more or less done with the first iteration of text cleaner, could you please review the code now.

obsei/preprocessor/text_cleaner.py

lalitpagaria · 2021-05-29T19:22:57Z

@shahrukhx01 Your PR is in good shape. Few points -

How about having one meeting so I can explain all design considerations?
Is it possible to add tests, I know that test coverage is not good we need it slowly
Can you please rebase with latest master changes

shahrukhx01 · 2021-05-29T23:35:16Z

How about having one meeting so I can explain all design considerations?

already texted on Slack about that

Is it possible to add tests, I know that test coverage is not good we need it slowly

sure, once you are happy with the final code, I'll start doing that.

Can you please rebase with latest master changes

will do.

… into add_preprocessor rebase with master

shahrukhx01 · 2021-05-30T20:57:26Z

Could you please have a look now? I was thinking about how the end-user would pick the cleaning functions, I ended up with the current design, using the enum, the user won't make mistakes and would be able to see the possible functions available, also it makes the code look more elegant. Now the cleaning pipeline can be executed something like this, what do you think?

from obsei.preprocessor.base_text_cleaner import CleaningFunctions as clean_funcs

request = AnalyzerRequest(
    processed_text="Peter drinks likely likes to tea at 16:45 #datascience @shahrukh "
)

conf = BaseTextProcessorConfig(
    text_cleaning_functions=[
        clean_funcs.to_lower_case,
        clean_funcs.remove_stop_words,
        clean_funcs.stem_text,
    ]
)
print(TextCleaner().clean_input(config=conf, input_list=[request]))

requirements.txt

lalitpagaria · 2021-05-31T06:13:08Z

Hi @shahrukhx01
Apologies for late reply.
I will check today and give feedback. I was thinking that we could do it during our meeting, but I understand better to do it before that.

obsei/preprocessor/base_text_cleaner.py

obsei/preprocessor/text_cleaner.py

lalitpagaria · 2021-05-31T06:58:28Z

@shahrukhx01 Thank you very much. You PR is in very good shape now. There are few comments, which we can discuss on our call. Sometime it is difficult to explain in text :)

… into add_preprocessor update requirements

shahrukhx01 · 2021-06-01T20:30:33Z

@lalitpagaria I have made the changes added the tests. Please review when you get time.

lalitpagaria · 2021-06-02T07:10:42Z

@shahrukhx01 Thanks I will review it today and share feedback.

…deration

lalitpagaria · 2021-06-02T20:00:36Z

@shahrukhx01 I added few changes to your PR based on future design consideration.
Overall you did excellent work, I liked your test cases as you have covered major scenarios.
Even it uncovered one bug in the code. I am merging your PR.
Thank you for your contribution. 🙏

lalitpagaria

LGTM

add preprocessor boilerplate

e09b587

lalitpagaria reviewed May 27, 2021

View reviewed changes

obsei/preprocessor/base_text_cleaner.py Outdated Show resolved Hide resolved

lalitpagaria reviewed May 27, 2021

View reviewed changes

obsei/preprocessor/base_text_cleaner.py Outdated Show resolved Hide resolved

add text preprocessing baseline functions

eff24c6

lalitpagaria reviewed May 28, 2021

View reviewed changes

obsei/preprocessor/text_cleaner.py Outdated Show resolved Hide resolved

lalitpagaria reviewed May 28, 2021

View reviewed changes

obsei/preprocessor/text_cleaner.py Outdated Show resolved Hide resolved

lalitpagaria reviewed May 28, 2021

View reviewed changes

obsei/preprocessor/text_cleaner.py Show resolved Hide resolved