NLP mein Data Preprocessing kya hoti hai?
Natural Language Processing (NLP) mein preprocessing ka matlab hota hai raw (kachay) text
data ko aise process karna ke machine usay samajh sake aur achi tarah analyze kar sake. Yeh
bohot zaroori step hota hai kyunki bina sahi preprocessing ke model ki performance weak ho
sakti hai.
Ab hum step-by-step samajhtay hain ke NLP mein data preprocessing kaise hoti hai:
1. Tokenization
👉 Matlab: Sentence ya paragraph ko chhoti chhoti units (words ya phrases) mein todna.
🔹 Example:
💬 Input: "Mujhe robotics pasand hai."
🔹 Tokenized Output: ["Mujhe", "robotics", "pasand", "hai", "."]
Code:
2. Stopword Removal
👉 Matlab: Aise words ko hatana jo NLP ke liye zyada useful nahi hote, jaise "ka", "ke", "hai",
"ko" etc.
🔹 Example:
💬 Input: "Mujhe robotics pasand hai."
🔹 Stopword Removed: ["Mujhe", "robotics", "pasand"]
3. Stemming aur Lemmatization
👉 Matlab: Ek word ko uski root form (asal shape) mein convert karna.
🔹 Example:
💬 Input: "ladkiyan khel rahi hain"
🔹 Stemming: ["ladki", "khel", "rah"]
🔹 Lemmatization: ["ladki", "khel", "rahi", "hai"] (Ye zyada accurate hota
hai)
4. Lowercasing
👉 Matlab: Sab words ko lowercase mein convert karna taake unki consistency bani rahe.
🔹 Example:
💬 Input: "AI Aur Robotics Bohot Interesting Hain!"
🔹 Lowercased Output: "ai aur robotics bohot interesting hain!"
5. Punctuation aur Special Characters ko Hatana
👉 Matlab: Extra punctuation marks ya special symbols ko remove karna jo model ke liye
zaroori nahi hote.
🔹 Example:
💬 Input: "Hello!!! Aap kaisay hain???"
🔹 Output: "Hello Aap kaisay hain"
6. Spelling Correction
👉 Matlab: Ghalat likhay gaye words ko sahi form mein convert karna.
🔹 Example:
💬 Input: "Mujhy robtiks pasnd hai."
🔹 Output: "Mujhe robotics pasand hai."
7. Text Encoding (Vectorization)
👉 Matlab: Words ko numbers mein convert karna taake machine unhe samajh sake. Kuch
common techniques ye hain:
1 ⃣ Bag of Words (BoW)
2 ⃣ TF-IDF (Term Frequency-Inverse Document Frequency)
3 ⃣ Word Embeddings (Word2Vec, GloVe, BERT, etc.)
🔹 Example:
💬 Sentence: "Robotics interesting hai"
🔹 Vector Form: [0.5, 0.8, 0.2, ...]
Conclusion
NLP mein data preprocessing ek bohot zaroori step hai jo raw text ko clean aur structured form
mein convert karta hai taake AI models usay properly understand kar sakein. Agar tum kisi NLP
project par kaam kar rahe ho, toh ye steps follow karna zaroori hoga! 😃🔥
Agar kisi specific step ka aur detail chahiye toh batao! 🚀
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1. LSTM (Long Short-Term Memory)
LSTM ka Concept:
LSTM ek special type of Recurrent Neural Network (RNN) hai, jo sequence data ko process
karte waqt long-term dependencies ko achi tarah se handle kar sakta hai. RNNs apne output ko
apne next step ke input ke tor par use karte hain, lekin jab sequences bohot lambi hoti hain (jaise
ek lamba sentence), toh RNNs purani information ko bhool jaate hain (yeh vanishing
gradient problem kehlata hai). LSTM is problem ko solve karte hain.
LSTM ka Structure (Gates):
LSTM mein 3 main gates hote hain, jo yeh decide karte hain ke kaunsi information ko yaad
rakhein, kaunsi ko bhool jayein, aur kaunsi ko update kiya jayein:
1. Forget Gate:
o Yeh gate purani memory ko "forget" karne ka kaam karta hai. Matlab, jab kisi
word ka importance kam ho jata hai, toh yeh gate decide karta hai ke purani
memory ko kitna delete karna hai.
o Formula:
f_t = σ(W_f * [h_(t-1), x_t] + b_f)
Yahan f_t forget gate ka output hai, x_t current input hai, aur h_(t-1)
previous hidden state hai.
2. Input Gate:
o Yeh gate nayi information ko update karne ke liye responsible hota hai. Matlab,
jab naye data ko model ko learn karne ki zarurat hoti hai, toh input gate decide
karta hai ke kitna data input mein se update karna hai.
o Formula:
i_t = σ(W_i * [h_(t-1), x_t] + b_i)
3. Output Gate:
o Yeh gate final output ko generate karne ke liye responsible hota hai, jo model ke
decision ko represent karta hai.
o Formula:
o_t = σ(W_o * [h_(t-1), x_t] + b_o)
LSTM, apne internal memory ko maintain karne mein help karta hai aur har time step par
important information ko filter karne ke liye use hota hai.
2. GRU (Gated Recurrent Unit)
GRU ka Concept:
GRU bhi LSTM ki tarah ek Recurrent Neural Network hai, lekin iska structure simpler hota
hai. Yeh bhi long-term dependencies ko handle karta hai lekin thoda efficient aur fast hota hai.
GRU ka Structure (Gates):
GRU mein sirf 2 gates hote hain, jo information ko update karte hain aur purani information ko
yaad rakhte hain.
1. Update Gate:
o Yeh gate decide karta hai ki purani memory ko kitna update kiya jaye aur kitna
naya information add kiya jaye.
o Formula:
z_t = σ(W_z * [h_(t-1), x_t] + b_z)
2. Reset Gate:
o Yeh gate purani memory ko reset karne ka kaam karta hai. Matlab, jab purani
information ki zyada zarurat nahi hoti, toh yeh gate purani memory ko reset
karne mein help karta hai.
o Formula:
r_t = σ(W_r * [h_(t-1), x_t] + b_r)
GRU ka kaam LSTM se thoda different hai kyunki ismein fewer parameters hote hain, jo isse
fast aur less computationally expensive banata hai. Yeh generally smaller datasets ke liye kaafi
effective hota hai.
3. Transformers
Transformers ka Concept:
Transformers ka concept self-attention mechanism par based hai, jisme har word apne sequence
ke baaki sab words se apna relationship samajhta hai. Iska key advantage yeh hai ki yeh
parallel processing kar sakte hain, jabke LSTM aur GRU ek time pe ek word ko process karte
hain.
Attention Mechanism (Self-Attention):
Attention mechanism ka goal yeh hai ke har word ko importance assign ki jaye, jo uske context
ko samajhne mein madad karta hai. Transformer me self-attention ka matlab hai ke har word
apne context ko samajhne ke liye baaki sab words se apni relevance ko samajhta hai.
1. Query, Key, aur Value:
o Query (Q), Key (K), aur Value (V) tensors hain jo har word ke liye calculate kiye
jate hain.
o Query ko baaki words ke Keys ke saath compare kiya jata hai aur relevance ko
calculate kiya jata hai.
o Example:
Attention = softmax((QK^T) / sqrt(d_k)) * V
Jahan d_k key ki dimension hai.
2. Multi-Head Attention:
o Transformer mein, ek hi time par multiple attention mechanisms ko apply karte
hain (multi-head), jo model ko richer representations dene mein madad karta
hai.
o Iska fayda yeh hota hai ki model alag-alag aspects se context ko samajhta hai.
3. Positional Encoding:
o Transformer model mein sequence ka order samajhna zaroori hota hai, isliye
positional encoding add ki jati hai taake model ko yeh samajh aaye ki words
sequence mein kis order mein hain.
4. Feed-Forward Neural Networks:
o Har attention layer ke baad, ek feed-forward neural network hota hai jo har
word ka representation transform karta hai.
Advantages of Transformers:
Parallelization: LSTM aur GRU ke comparison mein, Transformers ek time mein pure
sequence ko process karte hain, jo ki fast hota hai.
Handling Long Dependencies: Long-term dependencies ko efficiently handle karte hain.
Popular Transformer Models:
BERT (Bidirectional Encoder Representations from Transformers): Yeh model context
ko dono directions se samajhta hai (left-to-right aur right-to-left).
GPT (Generative Pre-trained Transformer): Yeh model unidirectional hota hai, yani left-
to-right.
Summary Comparison
LSTM: Long-term memory ko handle karta hai, complex hota hai, 3 gates (forget, input,
output).
GRU: LSTM se simple aur fast hai, 2 gates (update, reset).
Transformers: Self-attention mechanism ka use karke long dependencies ko efficiently
handle karte hain, parallel processing ka faida milta hai.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
CODE For Data Preprocessing
1. Tokenization
Tokenization ka matlab hai text ko chhoti chhoti units mein todna. Hum nltk library ka use
karenge.
Code:
import nltk
[Link]('punkt')
from [Link] import word_tokenize
# Sample Input
text = "Mujhe robotics pasand hai."
# Tokenization
tokens = word_tokenize(text)
print("Tokenized Output:", tokens)
Output:
Tokenized Output: ['Mujhe', 'robotics', 'pasand', 'hai', '.']
2. Stopword Removal
Stopwords wo words hote hain jo zyada important nahi hote text analysis ke liye, jaise "ka",
"ke", "hai", etc.
Code:
from [Link] import stopwords
[Link]('stopwords')
# Sample Input
text = "Mujhe robotics pasand hai."
# Tokenization
tokens = word_tokenize(text)
# Stopword Removal
stop_words = set([Link]('urdu')) # You can change language if
needed
filtered_tokens = [word for word in tokens if [Link]() not in stop_words]
print("Stopword Removed:", filtered_tokens)
Output:
Stopword Removed: ['Mujhe', 'robotics', 'pasand']
3. Stemming aur Lemmatization
Stemming aur lemmatization dono ka goal word ko uski base form mein laana hai. Stemming
less accurate hota hai aur lemmatization zyada accurate.
Code (Stemming):
from [Link] import PorterStemmer
# Initialize the stemmer
stemmer = PorterStemmer()
# Sample Input
text = "ladkiyan khel rahi hain"
# Tokenization
tokens = word_tokenize(text)
# Stemming
stemmed_tokens = [[Link](word) for word in tokens]
print("Stemming Output:", stemmed_tokens)
Output:
Stemming Output: ['ladki', 'khel', 'rah', 'hain']
Code (Lemmatization):
from [Link] import WordNetLemmatizer
[Link]('wordnet')
# Initialize the lemmatizer
lemmatizer = WordNetLemmatizer()
# Sample Input
text = "ladkiyan khel rahi hain"
# Tokenization
tokens = word_tokenize(text)
# Lemmatization
lemmatized_tokens = [[Link](word) for word in tokens]
print("Lemmatization Output:", lemmatized_tokens)
Output:
Lemmatization Output: ['ladkiyan', 'khel', 'rahi', 'hain']
4. Lowercasing
Lowercasing ka matlab hai text ke sare words ko lowercase mein convert karna.
Code:
# Sample Input
text = "AI Aur Robotics Bohot Interesting Hain!"
# Lowercasing
lowercased_text = [Link]()
print("Lowercased Output:", lowercased_text)
Output:
Lowercased Output: ai aur robotics bohot interesting hain!
5. Punctuation aur Special Characters ko Hatana
Punctuation aur special characters ko remove karna kaafi zaroori hota hai.
Code:
import string
# Sample Input
text = "Hello!!! Aap kaisay hain???"
# Removing Punctuation
clean_text = [Link]([Link]("", "", [Link]))
print("Text without Punctuation:", clean_text)
Output:
Text without Punctuation: Hello Aap kaisay hain
6. Spelling Correction
Spelling errors ko correct karna ke liye hum TextBlob library ka use karte hain.
Code:
from textblob import TextBlob
# Sample Input
text = "Mujhy robtiks pasnd hai."
# Spelling Correction
corrected_text = TextBlob(text).correct()
print("Corrected Text:", corrected_text)
Output:
Corrected Text: Mujhe robotics pasand hai.
7. Text Encoding (Vectorization)
Text ko machine ke samajhne ke liye numerical form mein convert karna hota hai. Yeh kaafi
tareeqon se kiya jaa sakta hai:
Bag of Words (BoW):
from sklearn.feature_extraction.text import CountVectorizer
# Sample Input
corpus = ['Robotics interesting hai']
# Bag of Words
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
# Display the result
print("BoW Output:", [Link]())
print("Feature Names:", vectorizer.get_feature_names_out())
Output:
BoW Output: [[1 1 1]]
Feature Names: ['hai' 'interesting' 'robotics']
TF-IDF (Term Frequency-Inverse Document Frequency):
from sklearn.feature_extraction.text import TfidfVectorizer
# Sample Input
corpus = ['Robotics interesting hai']
# TF-IDF Vectorization
tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(corpus)
# Display the result
print("TF-IDF Output:", X_tfidf.toarray())
print("Feature Names:", tfidf_vectorizer.get_feature_names_out())
Output:
TF-IDF Output: [[0.57735027 0.57735027 0.57735027]]
Feature Names: ['hai' 'interesting' 'robotics']
Word Embeddings (Word2Vec):
Word embeddings ko implement karne ke liye hum gensim library ka use kar sakte hain, lekin
yeh ek thoda complex example hai jo pre-trained models ka use karta hai.
from [Link] import Word2Vec
# Sample Input
sentences = [['Robotics', 'interesting', 'hai']]
# Train the model
model = Word2Vec(sentences, min_count=1)
# Word Embedding Example
vector = [Link]['robotics']
print("Word2Vec Embedding for 'robotics':", vector)
Conclusion:
Yeh sab steps NLP mein preprocessing ke liye kaafi zaroori hain. Ab aapko tokenization,
stopword removal, stemming & lemmatization, lowercasing, punctuation removal, spelling
correction, aur vectorization ke Python mein implementation samajh mein aa gaya hoga. Aap
in techniques ko apne NLP projects mein apply kar sakte hain! 😊
Agar koi aur doubt ho ya kisi specific code ki zarurat ho, toh zaroor batayein! 🚀