350 NLP
Projects
with Code
The Most Powerful NLP-Weapon Arsenal
Himanshu Ramchandani
M.Tech | Data Science
NLP Migrant Workers' Paradise: Almost the most complete
Chinese NLP resource library
In the process of getting started and getting familiar with NLP, I used a lot of packages
on github, so I sorted it out and shared it here.
⭐
Many bags are very interesting and worth collecting, satisfying everyone's collection
addiction! If you find it useful, please share and star ,thanks!
❤️❤️❤️
Long-term irregular updates, welcome to watch and fork!
🍆🍒🍐🍊 🌻🍓🍈🍅🍍
* Corpus * Document Processing
* Thesaurus and lexical tools * Table Processing
* Pre-trained language model * Text Matching
* Extraction * Text Data Enhancement
* Knowledge map * Text Retrieval
* Text generation * Reading Comprehension
* Text summarization * Sentiment Analysis
* Intelligent question answering * Common Regular
Expressions
* Text error correction
* Speech Processing
* Common regular expressions * Text visualization
* Event extraction * Text annotation tool
* Machine translation * Comprehensive tool
* Digital transformation * Funny and funny tool
* Anaphora resolution * Course report interview, etc.
* Text clustering * Competition
* Text classification * Financial NLP
* Knowledge reasoning * Medical NLP
* Explainable NLP * Legal NLP
* Text adversarial attack * Text generation image
* Others
corpus
Resource name Description Link
(Name)
Corpus of names wainshine/Chinese-
Names-Corpus
Chinese-Word-Vector Various Chinese word vectors github repo
s
Chinese Chat Corpus The library includes Douban link
Duolun, PTT gossip corpus,
Qingyun corpus, TV drama
dialogue corpus, Tieba forum
reply corpus, Weibo corpus,
little yellow chicken corpus
Chinese rumor data In this data file, each line github
contains a rumor data in json
format
Chinese Question link extract code
Answering Dataset 2dva
WeChat official The 3G corpus, which includes github
account corpus some articles from WeChat
official accounts captured from
the web, has removed HTML
and only contains plain text.
One article per line, in JSON
format, name is the name of
the WeChat official account,
account is the ID of the
WeChat official account, title is
the title, and content is the text
Chinese natural github
language processing
corpus, data set
Task-based dialogue [The most complete github
English dataset task-based dialogue data set]
mainly introduces a complete
task-based dialogue data set,
which covers the main
information of all commonly
used data sets in the field of
task-based dialogue. In
addition, in order to help
researchers better grasp the
context of field progress, we
present the State-of-the-art
experimental results on
several datasets in the form of
Leaderboard.
Speech Recognition Create an Automatic Speech github
Corpus Generation Recognition (ASR) corpus
Tool from online videos with
audio/subtitles
LitBank NLP dataset A corpus of 100 labeled github
English novels supporting
natural language processing
and computational humanities
tasks
ChineseULMFiT Sentiment Analysis Text github
Classification Corpus and
Model
The administrative github
division data of
provinces,
municipalities and
towns are marked
with pinyin
Automated github
Summarization
Corpus of Education
Industry News
Chinese Natural github
Language Processing
Dataset
Baidu Zhizhi Q&A More than 5.8 million github
Corpus questions, 9.38 million
answers, 5800 classification
labels. Based on the question
and answer corpus, it can
support a variety of
applications, such as chat
question and answer, logic
mining
Wikipedia Massively 85 languages, 1620 language github
Parallel Text Corpus pairs, 135M contrasting
sentences
Ancient Poetry github repo
Thesaurus
more complete
ancient poetry
lexicon
Low memory loading Use the new version of nlp github
Wikipedia data library to load 17GB+ English
Wiki corpus and only occupy
9MB of memory Traversal
speed 2-3 Gbit/s
couplet data 700,000 couplets, more than github
700,000 couplets
"Color Dictionary" github
dataset
42GB of JD github
Customer Service
Dialogue Data
(CSDD)
700,000 couplet data link
Username Blacklist github
List
Dependency parsing 40,000 high-quality labeled Homepage
corpus data
People's Daily github
Corpus Processing
Toolset
False news dataset github
fake news corpus
Poetry Quality github
Evaluation /
Fine-grained
Emotional Poetry
Corpus
Open tasks related to Dataset and current best github
Chinese natural results
language processing
Chinese abbreviation github
dataset
Chinese task Representative dataset - github
benchmarking benchmark (pretrained) model
- corpus - baseline - toolkit -
leaderboard
Chinese Rumor github
Database
CLUEDatasetSearch Chinese and English NLP github
datasets Search all Chinese
NLP datasets, with commonly
used English NLP datasets
attached
Multi-Document github
Summarization
Dataset
Make Everyone Transform impolite sentences paper and code
"Courteous" Polite into polite ones while
Migration Quest preserving meaning, providing
a dataset with 139M+
instances
Cantonese/English github
Conversational
Bilingual Corpus
List of Chinese NLP github
datasets
Nomenclature github
recognition data set
of person-like
names/place
names/organization
names
Chinese Language Includes representative github
Comprehension datasets & benchmark models
Benchmark & corpora & leaderboards
OpenCLaP Civil documents, criminal github
multi-domain open documents, Baidu
source Chinese Encyclopedia
pre-trained language
model warehouse
Chinese full word DRCD dataset: Released by github
coverage BERT and Delta Research Institute of
two reading Taiwan, China, it has the same
comprehension data form as SQuAD, and is an
extractive reading
comprehension dataset based
on traditional Chinese.
CMRC 2018 dataset: Chinese
machine reading
comprehension data released
by the Xunfei Joint Laboratory
of Harbin Institute of
Technology. According to a
given question, the system
needs to extract fragments
from the text as answers, in
the same form as SQuAD.
Dakshina dataset Latin/native script parallel github
dataset for twelve South Asian
languages
OPUS-100 Multilingual (100 kinds) github
parallel corpus centered on
English
Chinese Reading github
Comprehension
Dataset
Chinese natural github
language processing
vector collection
Chinese Language Includes representative github
Comprehension datasets, benchmark
Benchmark (pretrained) models, corpora,
leaderboards
Large list of NLP github
datasets/benchmark
tasks
LitBank NLP dataset A corpus of 100 labeled github
English novels supporting
natural language processing
and computational humanities
tasks
700,000 couplet data github
Parallel Corpus of The short chapters include github
Classical Chinese "The Analects of Confucius",
(Ancient "Mencius", "Zuo Zhuan" and
Chinese)-Modern other short ancient books,
Chinese which have been merged with
"Zi Zhi Tong Jian"
COLDDateset, Covers topics such as race, paper
Chinese Offensive gender, and region, and the
Language Detection data will be released after the
Dataset paper is published
Thesaurus and Lexical Tools
Resource name Description Link
(Name)
textfilter Sensitive word observerss/textfilter
filtering in Chinese
and English
Name extraction Chinese (modern, cocoNLP
function ancient) names,
Japanese names,
Chinese surnames
and first names,
titles (big aunt,
little aunt, etc.),
English ->
Chinese name
(John Lee), idiom
dictionary
Chinese National People's github
Abbreviation Library Congress:
National People's
Congress; China:
People's Republic
of China;
Women's Tennis:
Women/n Tennis/n
Game/vn
Chinese Dictionaries How to dismantle kfcd/chaizi
Chinese
characters (1)
How to dismantle
(2) How to
dismantle (3)
Lexical Sentiment Mountain spring rainarch/SentiBridge
Value water:
0.400704566541
Sufficient
: 0.37006739587
Chinese thesaurus, dongxiexidian/Chinese
stop words,
sensitive words
python-pinyin Convert Chinese mozillazg/python-pinyin
characters to
Pinyin
zhtools Conversion skydark/nstools
between
Traditional and
Simplified Chinese
English simulation say wo i ni #say: I tinyfool/ChineseWithEnglish
Chinese love you
pronunciation
engine
chinese_dictionary Thesaurus, guotong1988/chinese_dictionary
antonym, negative
thesaurus
wordninja English string wordninja
segmentation and
word extraction
without spaces
Vocabulary related data
to automobile brand
and automobile
parts
Thesaurus IT thesaurus, link
organized by THU financial
thesaurus, idiom
thesaurus, place
names, historical
celebrity
thesaurus, poetry
thesaurus,
medical
thesaurus, diet
thesaurus, legal
thesaurus,
automobile
thesaurus, animal
thesaurus
Crime Legal Terms Contains 856 github
and Classification crime knowledge
Model graphs, crime
prediction based
on 2.8 million
crime training
database, 13
types of question
classification and
legal information
question and
answer function
based on 20W
legal question and
answer pairs
Word segmentation Baidu network disk link -
corpus + code extraction code pea6
Chinese word keras link
segmentation + implementation
part-of-speech
tagging based on
Bi-LSTM + CRF
Chinese word link
segmentation and
part-of-speech
tagging based on
Universal
Transformer + CRF
Fast Neural Network java version
Word Segmentation
Package
chinese-xinhua Zhonghua Xinhua github
dictionary
database and api,
including
commonly used
Xiehouyu, idioms,
words and
Chinese
characters
SpaCy Chinese Contains Parser, github
model NER, syntax tree
and other
functions. Some
English packages
use spacy's
English model. If
you want to adapt
to Chinese, you
may need to use
spacy's Chinese
model.
Chinese character github
data
Synonyms Chinese github
Synonym Toolkit
Harvest Text Domain adaptive github
text mining tools
(new word
discovery-sentime
nt analysis-entity
linking, etc.)
word2word Easy-to-use github
multilingual
word-word pair set
62
languages/3,564
multilingual pairs
Polyphone github
dictionary data and
codes
Chinese characters, github
words, idioms query
interface
103976 English (sql version, csv github
vocabulary packs version, Excel
version)
Big list of swear github
words in English
word pinyin data github
Number calling github
library in 186
languages
Large-scale name github
database of
countries around the
world
Chinese character Extract the github
feature extractor features of
(featurizer) Chinese
characters
(pronunciation
features, font
features) for deep
learning features
char_featurizer - github
Chinese character
feature extraction
tool
Python interface github
library of mecab, the
CJK word
segmentation library
g2pC context-based github
Chinese
pronunciation
automatic marking
module
ssc, Sound Shape Phonetic code - version 1
Code Chinese character
string similarity version 2
calculation method
blog/introduction
based on
"phonetic code"
Acquisition of github
multiple
meanings/sense
items of Chinese
words and semantic
disambiguation of
specific sentences
based on the
encyclopedia
knowledge base
Tokenizer is a fast github
and customizable
text tokenization
library
Tokenizers State-of-the-art github
tokenizer with a
focus on
performance and
versatility
Realize text "face github
changing" through
synonym
replacement
token2index is a github
powerful lightweight
term index library
compatible with
PyTorch/Tensorflow
Traditional and github
Simplified
Conversion
Cantonese NLP github
Tools
domain dictionary Professional github
dictionary
knowledge base
covering 68 fields
with a total of 9.16
million words
Pre-trained language model & large model
Resource name (Name) Description Link
BMList Big Model Big List github
Chinese translation of bert link
papers
The slides of the original link
author of bert
Text Classification Practice github
bert tutorial text github
classification tutorial
Bert pytorch github
implementation
Bert pytorch github
implementation
BERT generates sentence github
vectors, BERT does text
classification and text
similarity calculation
Diagram of bert and ELMO github
BERT Pre-trained models github
and downstream
applications
Language/Knowledge github
Representation Tool BERT
& ERNIE
Using the gpt-2 language github
model in Kashgari
Facebook LAMA Probes for analyzing factual and github
commonsense knowledge contained
in pretrained language models.
Language model analysis, providing a
unified access interface for
Transformer-XL/BERT/ELMo/GPT
pre-trained language models
Chinese GPT2 training github
code
XLMFacebook's github
cross-language pre-trained
language model
Massive Chinese github
pre-trained ALBERT model
Transformers 20 Supports TensorFlow 20 and github
PyTorch's natural language
processing pre-trained language
models (BERT, GPT-2, RoBERTa,
XLM, DistilBert, XLNet...) 8
architectures/33 pre-trained
models/102 languages
8 papers sort out the github
progress and reflection of
BERT related models
French RoBERTa French RoBERTa pre-trained link
pre-trained language language model trained with 138GB
model corpus
Chinese pre-trained Pretrain Chinese Model based on github
ELECTREA model confrontational learning
albert-chinese-ner Use the pre-trained language model github
ALBERT to do Chinese NER
Open source pre-trained github
language model collection
Chinese ELECTRA github
pre-training model
Predicting Next Word with github
Transformers (BERT,
XLNet, Bart, Electra,
Roberta, XLM-Roberta)
(Model Comparison)
TensorFlow Hub New language models for 40+ link
languages (including Chinese)
UER Chinese pre-trained model github
warehouses based on different
corpora, encoders, and target tasks
(including BERT, GPT, ELMO, etc.)
Open source pre-trained github
language model collection
Multilingual sentence github
vector package
Language Model as a Language Model as a Service github
Service (LMaaS)
Open source language 20 billion parameters, currently the github
model GPT-NeoX-20B largest publicly accessible pre-trained
general autoregressive language
model
Chinese Science Literature Contains 396,209 meta-information github
Dataset (CSL) (titles, abstracts, keywords,
disciplines, categories) of papers in
Chinese core journals. The CSL
dataset can be used as a pre-training
corpus, and can also be used to
construct many NLP tasks, such as
text summarization (title prediction),
keyword generation, and text
classification.
Large model development github
artifact
extract
Resource name (Name) Description Link
time extraction It has been integrated into the java
python package cocoNLP , version
welcome to try
python
version
Neural network relationship Chinese is not supported yet github
extraction pytorch
Bert-based named entity Chinese is not supported yet github
recognition pytorch
Keyword (Keyphrase) extraction github
package pke
BLINK's most advanced entity github
link library
Named entity recognition github
implemented by BERT/CRF
Support batch parallel github
LatticeLSTM Chinese named
entity recognition
Building a Model for Medical Contains dictionaries and corpus github
Entity Recognition annotations, based on python
Pipeline Entity and Relationship - Entity and Relation Extraction github
Extraction Based on Based on TensorFlow and BERT
TensorFlow and BERT Pipeline entity and relationship
extraction based on TensorFlow
and BERT, the solution to the
information extraction task of the
2019 Language and Intelligence
Technology Competition.
Schema based Knowledge
Extraction, SKE 2019
Chinese named entity github
recognition NeuroNER vs
BertNER
Chinese Named Entity github
Recognition Based on BERT
Chinese key phrase extraction github
tool
bert tensorflow version for Chinese github
named entity recognition
bert-Kashgari Kashgari, a keras-based github
encapsulation classification and
labeling framework, can build a
classification or sequence
labeling model in a few minutes
cocoNLP Extraction of information such as github
name, address, email address,
mobile phone number, mobile
phone attribution, etc., rake
phrase extraction algorithm.
Microsoft Multilingual github
Number/Unit/Eg Date Time
Recognition Package
Baidu open source benchmark github
information extraction system
Chinese address word github
segmentation (identification and
extraction of address elements),
NER through sequence
annotation
Open Domain Text Knowledge github
Triple Extraction and
Knowledge Base Construction
Based on Dependency Syntax
Chinese keyword extraction github
method based on pre-training
model
chinese_keyphrase_extractor A tool for chinese keyphrase github
(CKPE) extraction A tool for quickly
extracting and identifying
keyphrases from natural
language text
Simple resume parser to extract github
key information from resumes
BERT-NER-Pytorch three github
different modes of BERT
Chinese NER experiments
knowledge map
Resource name (Name) Description Link
Tsinghua University Baidu, Chinese Wiki, English link
XLORE Chinese-English Wiki
cross-language
encyclopedia knowledge
map
Automatic generation of github
document maps
Question answering github
system based on
knowledge graph in This repo
medical field refers to
github
Chinese character github
relationship knowledge
map project
AmpliGraph Knowledge github
Graph Representation
Learning (Python) Library
Knowledge Graph
Concept Link Prediction
Chinese knowledge map github
materials, data and tools
Chinese Knowledge Extract triplet information and github
Graph Based on Baidu build a Chinese knowledge map
Encyclopedia
Zincbase Knowledge github
Graph Construction Toolkit
Question answering github
system based on
knowledge graph
Collation of knowledge github
map deep learning related
materials
Southeast University github
"Knowledge Graph"
graduate course (data)
Knowledge map car audio github
work project
"One Piece" Knowledge github
Graph
A dataset of 132 Covers common sense, city, link
knowledge graphs finance, agriculture, geography,
weather, social networking,
Internet of Things, medical care,
entertainment, life, business,
travel, science and education
Large-scale, structured, link
Chinese-English bilingual
COVID-19 Knowledge
Graph (COKG-19)
Event Triple Extraction github
Based on Dependency
Syntax and Semantic Role
Labeling
Abstract Knowledge The current scale is 500,000, github
Graph supporting the abstraction of
nominal entities, state
descriptions, and event actions
Large-scale Chinese github
knowledge map data 1.4
billion entities
Jiagu natural language Based on models such as github
processing tool BiLSTM, it provides functions
such as knowledge graph
relationship extraction, Chinese
word segmentation,
part-of-speech tagging, named
entity recognition, sentiment
analysis, new word discovery,
keyword text summarization,
text clustering, etc.
medical_NER - Chinese github
Medical Knowledge Graph
Named Entity Recognition
A large list of learning github
materials/datasets/tool
resources related to
knowledge graphs
LibKGE is a knowledge github
graph embedding library
for reproducible research
Military field knowledge Including aircraft, space github
map question answering equipment, etc. 8 categories,
project based on mongodb more than 100 subcategories, a
storage total of 5,800 items of military
weapons knowledge base, the
project does not use a graph
database for storage, through
jieba to analyze questions,
identify entity items in
questions, and complete based
on query templates The query
of multiple types of questions is
mainly to provide a demo of the
question-and-answer thinking in
the industry.
Jingdong Commodity github
Knowledge Graph
Chinese Relation github
Extraction Based on
Distant Supervision
Intelligent Question github
Answering System Based
on Medical Knowledge
Graph
BLINK's most advanced github
entity link library
A small securities github
knowledge
graph/knowledge base
dstlr unstructured text github
scalable knowledge map
construction platform
Baidu Encyclopedia Using BERT-based fine-tuning github
character entry attribute and feature extraction methods
extraction for knowledge graphs
Data related to COVID-19 New crown and other types of github
pneumonia Chinese medical
dialogue dataset; open data github
sources of institutions such as
Tsinghua University (COVID-19)
DGL-KE Graph github
Embedding
Representation Learning
Algorithm
causality map method data
Causal Event Pairs Based link
on Multi-Domain Text
Datasets
text generation
Resource name (Name) Description Link
Texar Toolkit for Text github
Generation and
Beyond
Prof. Ehud Reiter's Blog link Professor Wan
Xiaojun of Peking
University strongly
recommends this blog,
which conducts in-depth
discussions and
reflections on NLG
technology, evaluation
and application.
Large list of resources github
related to text generation
Open Domain Dialogue Natural language link
Generation and Its Practice generation allows
in Microsoft Xiaoice machines to
master the ability
of automatic
creation
Text Generation Control github
A large list of natural github
language generation related
resources
Evaluating Natural link
Language Generation with
BLEURT
Automatic couplet data and Code link
robots
700,000 couplet data
Automatically generate Generating github
comments comments based
on Hacker News
article titles using
Transformer codec
model
Natural language github
generation SQL statement
(English)
Natural Language github
Generation Resource
Collection
Benchmarking Chinese github
Generation Tasks
Topic-specific text github
generation/text
augmentation based on
GPT2
Encoding, Tokenization, github
and Implementation of a
Controlled and Efficient Text
Generation Methodology
TextFooler's adversarial text github
generation module for text
classification/inference
SimBERT BERT model github
based on UniLM
idea, integrating
retrieval and
generation
New word generation and Non-existing words github
sentence making generate new
words from scratch
with GPT-2
variants, their
definitions, and
example
sentences
Automatically generate github
multiple choice questions
from text
Synthetic Data Generation github
Benchmark
text summary
Resource name (Name) Descriptio Link
n
Chinese text summarization/keyword extraction github
Automatic Summarization of Resume Based on github
Named Entity Recognition
Automatic text summarization library TextTeaser English github
only
Extractive summary extraction based on the latest github
language models such as BERT
A Comprehensive Guide to Text Summarization with link
Deep Learning in Python
(Colab) Abstract Text Summary Implementation github
Highlights (Tutorial
Smart Q&A
Resource name (Name) Description Link
Chinese chatbot Train the chatbot you want github
according to your own corpus,
which can be used in scenarios
such as intelligent customer
service, online question and
answer, intelligent chat, etc.
Interesting robot qingyun Chinese chatbot trained by github
qingyun
Open dialogue robots, github
knowledge graphs, semantic
understanding, natural language
processing tools and data
qa right robot Amodel-for-Retrivalchatbot - git
customer service robot, Chinese
Retreival chatbot (Chinese
retrieval robot)
ConvLab open source github
multi-domain end-to-end
dialogue system platform
A dialog system based on the github
latest version of rasa
Chatbots based on the github
financial-judicial domain (with
the nature of small talk)
End-to-end closed-domain github
dialogue system
MiningZhiDaoQACorpus 5.8 million Baidu Zhizhi Q&A github
data mining project, Baidu Zhizhi
Q&A corpus, including more than
5.8 million questions, each with a
question label. Based on this
question and answer corpus, it
can support a variety of
applications, such as logic
mining
GPT2 model GPT2-chitchat for github
Chinese chatting
Selection of relevant resource github
lists (Leaderboards, Datasets,
Papers) based on multiple
rounds of responses from
retrieval chatbots
Microsoft Conversational Bot github
Framework
chatbot-list Application and architecture of github
intelligent customer service and
chatbots, algorithm sharing and
introduction in the industry
Chinese medical dialogue data github
Chinese medical dialogue data
set
A Large-Scale Medical Dialogue Contains 1.1 million medical github
Dataset consultations and 4 million
doctor-patient dialogues
Large-scale cross-domain paper
Chinese task-oriented & data
multi-round dialogue dataset
and model CrossWOZ
Open source conversational github
information search platform
Contextual Interaction github
Multimodal Dialogue Challenge
2020 (DSTC9 2020)
Use Quora questions to github
paraphrase the trained T5
questions (Paraphrase)
Google releases Taskmaster-2 github
natural language task dialogue
dataset
Haystack's flexible, powerful, github
and extensible Question
Answering (QA) framework
End-to-end closed-domain github
dialogue system
Amazon releases github
knowledge-based
human-human open domain
dialogue dataset
Albert Large QA model trained github
based on Baidu webqa and
dureader dataset
CommonsenseQA link
Commonsense-Oriented
English QA Challenge
MedQuAD (English) Medical github
Question Answering Dataset
A Q&A engine using Wikipedia github
text as context, based on Albert
and Electra
A question answering attempt Functions include Lyrics github
based on the 14W song Solitaire, Finding Songs with
knowledge base Known Lyrics, and Questions
and Answers about the
Triangular Relationship of Song
Artists Lyrics
text error correction
Resource name (Name) Description Link
Chinese text error correction github
module code
English spell checking library github
Python spell checking library github
GitHub Typo Corpus Large-Scale github
GitHub Multilingual
Spelling/Grammar Error Dataset
BertPunc BERT-based github
state-of-the-art punctuation repair
model
Chinese writing proofreading tool github
Text Error Correction Literature List Chinese Spell Checking github
(CSC) and Grammatical Error
Correction (GEC)
Winner of Text Smart Proofreading It has been applied, from the link
Contest team of Soochow University
and Dharma Academy
multimodal
Resource name Description Link
(Name)
Chinese Multimodal Huawei's Noah's Ark Laboratory open github
Dataset "Wukong" source large-scale, including 100 million
text pairs
Chinese graphic The Chinese version of the CLIP github
representation pre-training model, open source multiple
pre-training model model scales, and a few lines of code can
Chinese-CLIP handle Chinese image-text representation
extraction & image-text retrieval
speech processing
Resource name (Name) Description Link
ASR Speech Dataset + github
Chinese Speech
Recognition System Based
on Deep Learning
Tsinghua University data_thchs30tgz-O
THCHS30 Chinese Speech penSLR domestic
Dataset image
data_thchs30tgz
test-noisetgz-Open
SLR domestic
image test-noisetgz
resourcetgz-OpenS
LR domestic image
resourcetgz
Free ST Chinese
Mandarin Corpus
Free ST Chinese
Mandarin Corpus
AIShell-1 open
source version
dataset-OpenSLR
domestic image
AIShell-1 open
source version
dataset
Primewords
Chinese Corpus Set
1-OpenSLR
Domestic Mirror
Primewords
Chinese Corpus Set
1
laughter detector github
Common Voice Speech Includes over 1,400 link
Recognition Dataset New hours of speech
Version samples from 42,000
contributors, covering
github
speech-aligner A tool for generating github
phoneme-level
time-aligned
annotations from
"human voice speech"
and its "language text"
ASR Speech github
Dictionary/Dictionary
Speech Sentiment Analysis github
masr Chinese speech github
recognition, providing
pre-training model, high
recognition rate
Chinese Text Normalization github
for Speech Recognition
Voice quality evaluation github
indicators (MOSNet,
BSSEval, STOI, PESQ,
SRMR)
Chinese/English github
Pronunciation Dictionary
for Speech Recognition
Multilingual speech-text Includes audio, text github
translation corpus released transcription and
by CoVoSTEFacebook English translation in 11
languages (French,
German, Dutch,
Russian, Spanish,
Italian, Turkish, Persian,
Swedish, Mongolian
and Chinese)
Parakeet text-to-speech github
synthesis based on
PaddlePaddle
(Java) Accurate Speech github
Natural Language
Detection Library
Multilingual speech-text github
translation corpus released
by CoVoSTEFacebook
Text-to-Speech Synthesis github
Implemented in TensorFlow
2
Python audio feature github
extraction package
ViSQOL audio quality github
perception is objective and
complete reference index,
divided into two modes:
audio and voice
zhrtvc Easy-to-use Chinese github
voice clone and
Chinese speech
synthesis system
aukit An easy-to-use speech github
processing toolbox,
including speech noise
reduction, audio format
conversion, feature
spectrum generation
and other modules
phkit An easy-to-use github
phoneme processing
toolbox, including
Chinese phonemes,
English phonemes,
text-to-pinyin, text
regularization and other
modules
zhvoice Chinese speech corpus, github
the speech is clearer
and more natural,
including 8 open source
data sets, 3200
speakers, 900 hours of
speech, 13 million
words
audio for speech behavior , binarization, speaker github
detection recognition, automatic
speech recognition,
emotion recognition and
other audio annotation
tools
Deep Learning Emotional github
Text-to-Speech Synthesis
Python audio data github
augmentation library
Audio Enhancement Based github
on Large-Scale Audio
Dataset Audioset
voice transfer github
document processing
Resource name Description Link
(Name)
LayoutLM-v3 github
Document
Understanding
Model
PyLaia Deep github
Learning Toolkit
for Handwritten
Document
Analysis
Single-document github
unsupervised
keyword
extraction
DocSearch Free github
Documentation
Search Engine
fdfgen Ability to automatically create pdf link
documents and fill in information
pdfx Automatically extract cited references link
and download the corresponding pdf file
invoice2data Invoice pdf information extraction invoice2dat
a
PDF document github
information
extraction
PDFMiner PDFMiner can get the exact position of link
the text in the page, as well as other
information such as font or line. It also
has a PDF converter that can convert
PDF files to other text formats such as
HTML. There is also an extensible
parser PDF that can be used for other
purposes than text analysis.
PyPDF2 PyPDF 2 is a python PDF library capable link
of splitting, merging, cropping and
converting pages of PDF files. It can also
add custom data, viewing options and
passwords to PDF files. It can retrieve
text and metadata from PDFs, and can
also merge entire files together.
PyPDF2 PyPDF 2 is a python PDF library capable link
of splitting, merging, cropping and
converting pages of PDF files. It can also
add custom data, viewing options and
passwords to PDF files. It can retrieve
text and metadata from PDFs, and can
also merge entire files together.
ReportLab ReportLab can quickly create PDF link
documents. A time-proven,
super-easy-to-use open source project
for creating complex, data-driven PDF
documents and custom vector graphics.
It's free, open source, and written in
Python. With more than 50,000
downloads per month, the package is
part of standard Linux distributions,
embedded in many products, and was
chosen to power Wikipedia's print/export
functionality.
Simple PDF file github
text editor written
by SIMPdfPython
pdf-diff PDF file diff tool can display the github
difference between two pdf documents
form processing
Resource name (Name) Description Link
Use unet to realize github
automatic detection of
document tables and table
reconstruction
pdftabextract Used for form information link
analysis after OCR
recognition, very powerful
tabula-py Directly convert the table
information in pdf to pandas
dataframe, there are two
versions of codes in java and
python
camelot PDF form parsing link
pdfplumber PDF form parsing
PubLayNet Able to divide paragraphs, link
identify tables, pictures
Extract tabular data from github
papers
Finding answers in tables github
with BERT
Series of articles on table Introduction to
questions and answers the end of the
model
Generate tabular data github
using GAN (English only)
carefree-learn (PyTorch) Automated Machine Learning github
(AutoML) Package for Tabular
Datasets
Closed domain fine-tuning github
table detection
PDF form data extraction github
tool
TaBERT A New Model for paper
Understanding Tabular
Data Queries
form processing Awesome-Table-Recognition github
text match
Resource name (Name) Description Link
Sentence, QA similarity A collection of text similarity matching github
matching MatchZoo algorithms, including multiple deep
learning methods, worth trying.
Chinese Question Sentence github
Similarity Calculation
Competition and Scheme
Summary
similarity similarity Written in java, it is used for similarity github
calculation toolkit calculations related to words,
phrases, sentences, lexical analysis,
sentiment analysis, semantic
analysis, etc.
Chinese word similarity Combined with the word similarity gihtub
calculation method calculation method of Synonyms Cilin
Extended Edition and Hownet, the
vocabulary coverage is more and the
results are more accurate.
Python string similarity github
algorithm library
Similar sentence judgment 100,000 training samples provided github
model based on Siamese
bilstm model, providing
training data set and test
data set
Text Data Augmentation
Resource name (Name) Descriptio Link
n
Chinese NLP Data Augmentation (EDA) Tool github
English NLP data enhancement tool github
One-click Chinese data enhancement tool github
The application and effect of data enhancement in link
machine translation and other nlp tasks
NLP Data Augmentation Resource Collection github
Common regular expressions
Resource name Description Link
(Name)
Regular It has been
expression to integrated into the
extract email python package
cocoNLP , welcome
to try
Extract It has been
phone_number integrated into the
python package
cocoNLP , welcome
to try
Regular IDCards_pattern =
expression for r'^([1-9]\d{5}[12]\d{3}(0[1-9]|1[01
extracting ID 2])(0[1-9]|[12][0
number -9]|3[01])\d{3}[0-9xX])
IDs =
re.findall(IDCards_pattern, text,
flags=0)
IP address regular (25[0-5]| 2[0-4]\d| [0-1]\d{2}|
expression [1-9]?\d).(25[0-5]| 2[0- 4]\d|
[0-1]\d{2}| [1-9]?\d).(25[0-5]|
2[0-4]\d| [0-1]\d {2}|
[1-9]?\d).(25[0-5]| 2[0-4]\d|
[0-1]\d{2}| [1-9]?\d )
Tencent QQ [1-9]([0-9]{5,11})
number regular
expression
Domestic [0-9-()()]{7,18}
fixed-line number
regular expression
username regex [A-Za-z0-9_-\u4e00-\u9fa5]+
Regular matching github
of domestic phone
numbers (three
major operators +
virtual, etc.)
Regular github
Expression
Tutorial
text search
Resource name (Name) Description Link
Efficient Fuzzy Search Tool github
Large list/search engine of link
BERT models for various
languages/tasks
Deepmatch's deep matching github
model library for
recommendation, advertising
and search
wwsearch is a full-text search github
engine developed by the
enterprise WeChat background
aili - the fastest in-memory github
index in the East
Efficient string matching tool a fast string matching library for github
RapidFuzz Python and C++, which is using
the string similarity calculations
from FuzzyWuzzy
reading comprehension
Resource name (Name) Descriptio Link
n
Efficient Fuzzy Search Tool github
Large list/search engine of BERT models for various link
languages/tasks
Deepmatch's deep matching model library for github
recommendation, advertising and search
allennlp reading comprehension supports a variety of github
data and models
emotion analysis
Resource name (Name) Description Link
aspect sentiment analysis github
package
awesome-nlp-sentiment-analysis Sentiment analysis, emotional github
cause identification, evaluation
object and evaluation word
extraction
Sentiment analysis technology github
enables intelligent customer
service to better understand
human emotions
event extraction
Resource name (Name) Descriptio Link
n
Chinese event extraction github
List of Literature Resources for NLP Event Extraction github
BERT event extraction implemented by PyTorch github
(ACE 2005 corpus)
News Event Clue Extraction github
machine translation
Resource Description Link
name (Name)
no way The command line version of Youdao Dictionary github
dictionary supports English-Chinese mutual search and online
search
NLLB Language model NLLB that supports arbitrary link
inter-translation of 200+ languages
Easy-Translat Script to translate large text files locally, based on github
e Facebook/Meta AI's M2M100 model and NLLB200
model, supports 200+ languages
digital conversion
Resource name (Name) Descriptio Link
n
The best Chinese character number (Chinese github
number)-Arabic number conversion tool
Quickly convert "Chinese numerals" and "Arabic github
numerals"
Parse and convert natural language numeric strings github
to integers and floating point numbers
anaphora resolution
Resource name (Name) Descriptio Link
n
Chinese reference to digestion github
data
baidu ink code a0qq
text clustering
Resource name (Name) Descriptio Link
n
TextCluster short text clustering preprocessing github
module Short text cluster
Text Categorization
Resource name (Name) Descriptio Link
n
NeuralNLP-NeuralClassifier Tencent open source github
deep learning text classification tool
knowledge reasoning
Resource name (Name) Descriptio Link
n
GraphbrainAI is an open source software library and github
research tools designed to facilitate automatic
meaning extraction and text understanding as well as
knowledge exploration and inference
(Harvard) free book on causal reasoning pdf
Interpretable Natural Language Processing
Resource name (Name) Descriptio Link
n
State-of-the-art interpreter library for textual machine github
learning models
text attack
Resource name (Name) Description Link
TextAttack natural github
language processing
model adversarial attack
framework
OpenBackdoor: Text OpenBackdoor is developed based on github
backdoor attack and Python and PyTorch, which can be used
defense toolkit to reproduce, evaluate and develop
related algorithms for text backdoor
attack and defense
text visualization
Resource name (Name) Description Link
Scattertext text github
visualization (python)
whatlies word vector spacytool
interactive visualization s
PySS3 machine github
visualization tool for SS3
text classifiers for
explainable AI
Render 3D images with github
Notepad
attnvisGPT2, BERT and github
other transformer language
models attention interactive
visualization
Texthero text data efficient Including preprocessing, keyword github
processing package extraction, named entity
recognition, vector space
analysis, text visualization, etc.
text annotation tool
Resource name (Name) Descriptio Link
n
Overview of NLP annotation platform github
brat rapid annotation tool sequence annotation tool link
Poplar web version natural language annotation tool github
LIDA is a lightweight interactive dialogue annotation github
tool
doccano is a web-based open source collaborative github
multilingual text annotation tool
Datasaurai online data labeling workflow link
management tool
language detection
Resource Description Link
name
(Name)
langid 97 https://github.com/saffsd/langid.py
languages
detected
langdetect language https://code.google.com/archive/p/language-de
detection tection/
comprehensive tool
Resource name Description Link
(Name)
jieba jieba
hanlp hanlp
nlp4han Chinese natural language processing tool set github
(sentence segmentation/word
segmentation/part-of-speech
tagging/chunking/syntax analysis/semantic
analysis/NER/N-gram/HMM/pronoun
resolution/sentiment analysis/spelling check
Progress in Hate link
Speech Detection
Bert application Including named entity recognition, github
based on Pytorch sentiment analysis, text classification and
text similarity, etc.
nlp4han Chinese Sentence segmentation/word github
natural language segmentation/part-of-speech
processing toolset tagging/chunking/syntactic analysis/semantic
analysis/NER/N-gram/HMM/pronoun
resolution/sentiment analysis/spelling check
Some basic models github
of natural language
Template code for github
sequence tagging
and text
classification with
BERT
jieba_fast github
accelerated version
of jieba
Stanford NLP Pure Python version of natural language link
processing package
Python Spoken github
Natural Language
Processing Toolset
(English)
PreNLP natural github
language
preprocessing
library
Some papers and Including topic model, word vector (Word github
codes related to nlp Embedding), named entity recognition
(NER), text classification (Text Classificatin),
text generation (Text Generation), text
similarity (Text Similarity) calculation, etc.,
involving various nlp-related Algorithm,
based on keras and tensorflow
Python text github
mining/NLP practical
example
Forte's flexible and github
powerful natural
language
processing pipeline
toolset
stanza Stanford Can handle more than sixty languages github
team NLP tools
Fancy-NLP is a text github
knowledge mining
tool for building
product portraits
Comprehensive and github
easy Chinese NLP
toolkit
Recurrence of github
vectorized recall
pipelines commonly
used in the industry
based on DSSM
Texthero text data Including preprocessing, keyword extraction, github
efficient processing named entity recognition, vector space
package analysis, text visualization, etc.
nlpgnn graph neural github
network natural
language
processing toolbox
Macadam Based on Tensorflow (Keras) and github
bert4keras, a natural language processing
toolkit focusing on text classification,
sequence labeling and relation extraction
LineFlow is an github
efficient NLP data
loader for all deep
learning frameworks
Arabica: Python text github
data exploratory
analysis toolkit
Python stress github
testing tool:
SMSBoom
funny tool
Resource name Description Link
(Name)
Wang Feng Lyric phunterlau/wangfeng-r
Generator nn
Analysis of github
girlfriend's
emotional
fluctuations
NLP is too github
difficult series
Variable naming github link
artifact
Image text github
removal, can be
used for manga
translation
CoupletAI - Automatic couplet system github
couplet based on
generation CNN+Bi-LSTM+Attention
Solving Complex github
Mathematical
Equations Using
Neural Network
Symbolic
Reasoning
Question Functions include Lyrics github
answering robot Solitaire, Finding Songs with
based on 14W Known Lyrics, and Questions
song knowledge and Answers about the
base Triangular Relationship of Song
Artists Lyrics
COPE - Metric github
Poem Editor
Paper2GUI An AI desktop APP toolbox for github
ordinary people. It can be used
immediately without installation.
It already supports 18+ AI
models, covering speech
synthesis, video frame
complementing, video
super-resolution, target
detection, image stylization,
OCR recognition, etc.
Politeness github paper
estimator (trained
using Sina Weibo
data)
Grass python Chinese programming homepage gitee
(Python Chinese language
version) getting
started guide
course report interview
Resource name Description Link
(Name)
Natural Language link
Processing Report
Knowledge Graph link
Report
Data Mining Report link
autonomous driving link
report
Machine translation link
report
blockchain report link
robot report link
Computer Graphics link
Report
3D printing report link
Facial Recognition link
Report
Artificial Intelligence link
Chip Report
cs224n deep learning pytorch
natural language implementation of the
processing course model in the link
courselink
Natural Language github
Processing by
Example Tutorial for
Deep Learning
Researchers
"Natural Language github
Processing" by Jacob
Eisenstein
ML-NLP Machine learning (Machine github
Learning), knowledge
points and code
implementation often
tested in NLP interviews
NLP task example github
project code set
2019 NLP Highlights download
Review
nlp-recipes produced github
by Microsoft--best
practices and
examples of natural
language processing
Natural Language github
Processing by
Example Tutorial for
Deep Learning
Researchers
Transfer Learning in youtube
Natural Language
Processing (NLP)
Machine Learning link github
Systems book
Contest
Resource name (Name) Descriptio Link
n
Review the TOP solutions of all NLP competitions github
2019 Baidu Triple Extraction Competition, "Scientific github
Space Team" source code (7th place)
Financial Natural Language Processing
Resource name (Name) Descriptio Link
n
BDCI2019 Financial Negative Information Judgment github
Open source financial investment data extraction tool github
A large list of natural language processing research github
resources in the financial field
Chatbots based on the financial-judicial domain (with github
the nature of small talk)
Demonstration of small-scale financial knowledge github
graph construction process
Medical Natural Language Processing
Resource name Description Link
(Name)
Chinese medical NLP github
public resources
arrangement
spaCy Medical Text github
Mining and
Information Extraction
Building a Model for Contains dictionaries and corpus github
Medical Entity annotations, based on python
Recognition
Question answering github This
system based on repo refers to
knowledge graph in github
medical field
Chinese medical github
dialogue data
Chinese medical
dialogue data set
A Large-Scale Contains 1.1 million medical github
Medical Dialogue consultations and 4 million
Dataset doctor-patient dialogues
Data related to New crown and other types of github
COVID-19 pneumonia Chinese medical
dialogue dataset; open data github
sources of institutions such as
Tsinghua University (COVID-19)
Legal Natural Language Processing
Resource name Description Link
(Name)
Blackstone's spaCy github
pipeline and NLP
model for
unstructured legal
text
List of Forensic github
Intelligence Literature
Resources
Chatbots based on github
the financial-judicial
domain (with the
nature of small talk)
Crime Legal Terms Contains 856 crime knowledge graphs, github
and Classification crime prediction based on 2.8 million crime
Model training database, 13 types of question
classification and legal information question
and answer function based on 20W legal
question and answer pairs
text to image
Resource name Description Link
(Name)
Dalle-mini A mini version of DALL·E that generates github
pictures based on text prompts
other
Resource name (Name) Description Link
phone China mobile phone ls0f/phone
attribution query
phone International mobile AfterShip/phone
phone and telephone
attribution inquiry
ngender gender based on observers/ngende
name r
A summary of the differences link
between Chinese and English
natural language processing
NLP
Technical documents PDF or github
PPT shared by Daniel in each
major company
comparxiv is used to compare pypi
the difference between two
submitted versions on arXiv
Meta-architecture of github
CHAMELEON deep learning
news recommendation system
Automatic Resume Screening github
System
A variety of text readability github
evaluation indicators
implemented by Python
Data Science ML Full Stack Roadmap
https://github.com/hemansnation/Data-Science-ML-Full-Stack-2022
Join the Data Science & ML Full Stack WhatsApp Group Community here:
If the group is full, please join another one.
https://chat.whatsapp.com/B7Mdp6QTMJ0KZYGWrziT3Y
https://chat.whatsapp.com/HWDSJU4KXrXJIcn5Npp3Gm
https://chat.whatsapp.com/DmATV5uaVY7IKrTMHDiHnr
https://chat.whatsapp.com/Blz2n8QYSgdKWfQbJZxHtJ
Join Telegram for Data Science ML AI Resources:
https://t.me/+sREuRiFssMo4YWJl
Join Community on LinkedIn:
https://www.linkedin.com/groups/12540639/
Connect with me on these platforms:
LinkedIn: https://www.linkedin.com/in/hemansnation/
Twitter: https://twitter.com/hemansnation
GitHub: https://github.com/hemansnation
Instagram: https://www.instagram.com/masterdexter.ai/
Are you a professional?
DM for One-on-One sessions for Python, Data Science, Machine Learning,
and Data Engineering.
Here: https://bit.ly/3U6zQvQ
Python Notion Template
https://hemansnation.gumroad.com/l/god-level-python-with-himanshu-ra
mchandani