👩‍💻 API next steps: checklist

The following contains a high-level view of what will be the next main enhancement steps. This document will be kept up-to-date and improved frequently. This work will be mainly conducted by @mk2510 and @henrifroese as part of their SummerOfCode project. 

---

1. Version 1.10
    - [x] Every representation function to receive as input a `TokenSeries` #44
    - [x] Decouple TF-IDF L2-normalization and TF-IDF #76
    - [x] Rename `term_frequency` to `count()` + add function`term_frequency` #61
    - [x] Introduce `HeroSeries`
    - [x] Add ~ hero.norm(RepresetationSeries, "l1"/"l2") 
    - [x] Can we avoid the use of `VectorSeries`/`TokenSeries`?
    - [x] All `representation` functions to deal with `HeroSeries` + (DocumentTermDF) #43  
    - [ ] Update README + getting-started.md
    - [ ] Push a new version to PyPi

1. Performance: speed-up the library
    - [ ] Most of Texthero data structure are list of list ([["a", "document"], ["another", "document"]]), can we leverage parallelization? We can learn from spaCy. Mandatory read: [100-times-faster-nlp](https://medium.com/huggingface/100-times-faster-natural-language-processing-in-python-ee32033bdced); look at [this](https://github.com/modin-project/modin) for parallelization 
    - [ ] Make spaCy function faster + Dask vs Spacy #65
    - [ ] Depending on the previous task, evaluate if we want to have as default tokenizer `spaCy`: #131 
    
1. Software development:
    - [x] Integrate checking for correct Series types (#60, #55, ...)
    - [ ] Check hero functions work with np.nan #86

1. Support Embeddings through [Flair](https://github.com/flairNLP/flair)
    - [ ] Add hero.embed(s, flairEmbedding)

5. Add Topic Modeling
   - [ ] Add topic modeling support under representation #42
          This include also "topic modeling visualization" to get insights out of it
   - [ ] Add a blog article on how topic modeling with Texthero works

4. Extra
      1. test coverage
      1. expand multilingual: more languages; recognize languages and select correct one
      1. (low priority) Text summarization (#38) and characteristic terms (#2)
      

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

👩‍💻 API next steps: checklist #85

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

👩‍💻 API next steps: checklist #85

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions