0% found this document useful (0 votes)

17 views11 pages

NLPNotes

NLP mein data preprocessing raw text data ko aise process karne ka amal hai ke machine usay samajh sake aur analyze kar sake. Ismein tokenization, stopword removal, stemming, lemmatization, lowercasing, punctuation removal, spelling correction, aur text encoding jaise steps shamil hain. Yeh steps model ki performance ko behtar banane ke liye zaroori hain.

Uploaded by

rajanatiq42

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views11 pages

NLPNotes

Uploaded by

rajanatiq42

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

NLP mein Data Preprocessing kya hoti hai?

Natural Language Processing (NLP) mein preprocessing ka matlab hota hai raw (kachay) text
data ko aise process karna ke machine usay samajh sake aur achi tarah analyze kar sake. Yeh
bohot zaroori step hota hai kyunki bina sahi preprocessing ke model ki performance weak ho
sakti hai.

Ab hum step-by-step samajhtay hain ke NLP mein data preprocessing kaise hoti hai:

1. Tokenization

👉 Matlab: Sentence ya paragraph ko chhoti chhoti units (words ya phrases) mein todna.
🔹 Example:
💬 Input: "Mujhe robotics pasand hai."
🔹 Tokenized Output: ["Mujhe", "robotics", "pasand", "hai", "."]

Code:

2. Stopword Removal

👉 Matlab: Aise words ko hatana jo NLP ke liye zyada useful nahi hote, jaise "ka", "ke", "hai",
"ko" etc.
🔹 Example:
💬 Input: "Mujhe robotics pasand hai."
🔹 Stopword Removed: ["Mujhe", "robotics", "pasand"]

3. Stemming aur Lemmatization

👉 Matlab: Ek word ko uski root form (asal shape) mein convert karna.
🔹 Example:
💬 Input: "ladkiyan khel rahi hain"
🔹 Stemming: ["ladki", "khel", "rah"]
🔹 Lemmatization: ["ladki", "khel", "rahi", "hai"] (Ye zyada accurate hota
hai)
4. Lowercasing

👉 Matlab: Sab words ko lowercase mein convert karna taake unki consistency bani rahe.
🔹 Example:
💬 Input: "AI Aur Robotics Bohot Interesting Hain!"
🔹 Lowercased Output: "ai aur robotics bohot interesting hain!"

5. Punctuation aur Special Characters ko Hatana

👉 Matlab: Extra punctuation marks ya special symbols ko remove karna jo model ke liye
zaroori nahi hote.
🔹 Example:
💬 Input: "Hello!!! Aap kaisay hain???"
🔹 Output: "Hello Aap kaisay hain"

6. Spelling Correction

👉 Matlab: Ghalat likhay gaye words ko sahi form mein convert karna.
🔹 Example:
💬 Input: "Mujhy robtiks pasnd hai."
🔹 Output: "Mujhe robotics pasand hai."

7. Text Encoding (Vectorization)

👉 Matlab: Words ko numbers mein convert karna taake machine unhe samajh sake. Kuch
common techniques ye hain:
1 ⃣ Bag of Words (BoW)
2 ⃣ TF-IDF (Term Frequency-Inverse Document Frequency)
3 ⃣ Word Embeddings (Word2Vec, GloVe, BERT, etc.)

🔹 Example:
💬 Sentence: "Robotics interesting hai"
🔹 Vector Form: [0.5, 0.8, 0.2, ...]

Conclusion
NLP mein data preprocessing ek bohot zaroori step hai jo raw text ko clean aur structured form
mein convert karta hai taake AI models usay properly understand kar sakein. Agar tum kisi NLP
project par kaam kar rahe ho, toh ye steps follow karna zaroori hoga! 😃🔥

Agar kisi specific step ka aur detail chahiye toh batao! 🚀

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

1. LSTM (Long Short-Term Memory)

LSTM ka Concept:

LSTM ek special type of Recurrent Neural Network (RNN) hai, jo sequence data ko process
karte waqt long-term dependencies ko achi tarah se handle kar sakta hai. RNNs apne output ko
apne next step ke input ke tor par use karte hain, lekin jab sequences bohot lambi hoti hain (jaise
ek lamba sentence), toh RNNs purani information ko bhool jaate hain (yeh vanishing
gradient problem kehlata hai). LSTM is problem ko solve karte hain.

LSTM ka Structure (Gates):

LSTM mein 3 main gates hote hain, jo yeh decide karte hain ke kaunsi information ko yaad
rakhein, kaunsi ko bhool jayein, aur kaunsi ko update kiya jayein:

1. Forget Gate:
o Yeh gate purani memory ko "forget" karne ka kaam karta hai. Matlab, jab kisi
word ka importance kam ho jata hai, toh yeh gate decide karta hai ke purani
memory ko kitna delete karna hai.
o Formula:
f_t = σ(W_f * [h_(t-1), x_t] + b_f)
Yahan f_t forget gate ka output hai, x_t current input hai, aur h_(t-1)
previous hidden state hai.
2. Input Gate:
o Yeh gate nayi information ko update karne ke liye responsible hota hai. Matlab,
jab naye data ko model ko learn karne ki zarurat hoti hai, toh input gate decide
karta hai ke kitna data input mein se update karna hai.
o Formula:
i_t = σ(W_i * [h_(t-1), x_t] + b_i)
3. Output Gate:
o Yeh gate final output ko generate karne ke liye responsible hota hai, jo model ke
decision ko represent karta hai.
o Formula:
o_t = σ(W_o * [h_(t-1), x_t] + b_o)
LSTM, apne internal memory ko maintain karne mein help karta hai aur har time step par
important information ko filter karne ke liye use hota hai.

2. GRU (Gated Recurrent Unit)

GRU ka Concept:

GRU bhi LSTM ki tarah ek Recurrent Neural Network hai, lekin iska structure simpler hota
hai. Yeh bhi long-term dependencies ko handle karta hai lekin thoda efficient aur fast hota hai.

GRU ka Structure (Gates):

GRU mein sirf 2 gates hote hain, jo information ko update karte hain aur purani information ko
yaad rakhte hain.

1. Update Gate:
o Yeh gate decide karta hai ki purani memory ko kitna update kiya jaye aur kitna
naya information add kiya jaye.
o Formula:
z_t = σ(W_z * [h_(t-1), x_t] + b_z)
2. Reset Gate:
o Yeh gate purani memory ko reset karne ka kaam karta hai. Matlab, jab purani
information ki zyada zarurat nahi hoti, toh yeh gate purani memory ko reset
karne mein help karta hai.
o Formula:
r_t = σ(W_r * [h_(t-1), x_t] + b_r)

GRU ka kaam LSTM se thoda different hai kyunki ismein fewer parameters hote hain, jo isse
fast aur less computationally expensive banata hai. Yeh generally smaller datasets ke liye kaafi
effective hota hai.

3. Transformers

Transformers ka Concept:

Transformers ka concept self-attention mechanism par based hai, jisme har word apne sequence
ke baaki sab words se apna relationship samajhta hai. Iska key advantage yeh hai ki yeh
parallel processing kar sakte hain, jabke LSTM aur GRU ek time pe ek word ko process karte
hain.
Attention Mechanism (Self-Attention):

Attention mechanism ka goal yeh hai ke har word ko importance assign ki jaye, jo uske context
ko samajhne mein madad karta hai. Transformer me self-attention ka matlab hai ke har word
apne context ko samajhne ke liye baaki sab words se apni relevance ko samajhta hai.

1. Query, Key, aur Value:

o Query (Q), Key (K), aur Value (V) tensors hain jo har word ke liye calculate kiye
jate hain.
o Query ko baaki words ke Keys ke saath compare kiya jata hai aur relevance ko
calculate kiya jata hai.
o Example:
Attention = softmax((QK^T) / sqrt(d_k)) * V
Jahan d_k key ki dimension hai.
2. Multi-Head Attention:
o Transformer mein, ek hi time par multiple attention mechanisms ko apply karte
hain (multi-head), jo model ko richer representations dene mein madad karta
hai.
o Iska fayda yeh hota hai ki model alag-alag aspects se context ko samajhta hai.
3. Positional Encoding:
o Transformer model mein sequence ka order samajhna zaroori hota hai, isliye
positional encoding add ki jati hai taake model ko yeh samajh aaye ki words
sequence mein kis order mein hain.
4. Feed-Forward Neural Networks:
o Har attention layer ke baad, ek feed-forward neural network hota hai jo har
word ka representation transform karta hai.

Advantages of Transformers:

 Parallelization: LSTM aur GRU ke comparison mein, Transformers ek time mein pure
sequence ko process karte hain, jo ki fast hota hai.
 Handling Long Dependencies: Long-term dependencies ko efficiently handle karte hain.

Popular Transformer Models:

 BERT (Bidirectional Encoder Representations from Transformers): Yeh model context

ko dono directions se samajhta hai (left-to-right aur right-to-left).
 GPT (Generative Pre-trained Transformer): Yeh model unidirectional hota hai, yani left-
to-right.

Summary Comparison
 LSTM: Long-term memory ko handle karta hai, complex hota hai, 3 gates (forget, input,
output).
 GRU: LSTM se simple aur fast hai, 2 gates (update, reset).
 Transformers: Self-attention mechanism ka use karke long dependencies ko efficiently
handle karte hain, parallel processing ka faida milta hai.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
CODE For Data Preprocessing

1. Tokenization

Tokenization ka matlab hai text ko chhoti chhoti units mein todna. Hum nltk library ka use
karenge.

Code:
import nltk
[Link]('punkt')

from [Link] import word_tokenize

# Sample Input
text = "Mujhe robotics pasand hai."

# Tokenization
tokens = word_tokenize(text)

print("Tokenized Output:", tokens)

Output:
Tokenized Output: ['Mujhe', 'robotics', 'pasand', 'hai', '.']

2. Stopword Removal

Stopwords wo words hote hain jo zyada important nahi hote text analysis ke liye, jaise "ka",
"ke", "hai", etc.

Code:
from [Link] import stopwords

[Link]('stopwords')

# Sample Input
text = "Mujhe robotics pasand hai."

# Tokenization
tokens = word_tokenize(text)

# Stopword Removal
stop_words = set([Link]('urdu')) # You can change language if
needed
filtered_tokens = [word for word in tokens if [Link]() not in stop_words]

print("Stopword Removed:", filtered_tokens)

Output:
Stopword Removed: ['Mujhe', 'robotics', 'pasand']

3. Stemming aur Lemmatization

Stemming aur lemmatization dono ka goal word ko uski base form mein laana hai. Stemming
less accurate hota hai aur lemmatization zyada accurate.

Code (Stemming):
from [Link] import PorterStemmer

# Initialize the stemmer

stemmer = PorterStemmer()

# Sample Input
text = "ladkiyan khel rahi hain"

# Tokenization
tokens = word_tokenize(text)

# Stemming
stemmed_tokens = [[Link](word) for word in tokens]

print("Stemming Output:", stemmed_tokens)

Output:
Stemming Output: ['ladki', 'khel', 'rah', 'hain']
Code (Lemmatization):
from [Link] import WordNetLemmatizer

[Link]('wordnet')

# Initialize the lemmatizer

lemmatizer = WordNetLemmatizer()

# Sample Input
text = "ladkiyan khel rahi hain"

# Tokenization
tokens = word_tokenize(text)

# Lemmatization
lemmatized_tokens = [[Link](word) for word in tokens]

print("Lemmatization Output:", lemmatized_tokens)

Output:
Lemmatization Output: ['ladkiyan', 'khel', 'rahi', 'hain']

4. Lowercasing

Lowercasing ka matlab hai text ke sare words ko lowercase mein convert karna.
Code:
# Sample Input
text = "AI Aur Robotics Bohot Interesting Hain!"

# Lowercasing
lowercased_text = [Link]()

print("Lowercased Output:", lowercased_text)

Output:
Lowercased Output: ai aur robotics bohot interesting hain!

5. Punctuation aur Special Characters ko Hatana

Punctuation aur special characters ko remove karna kaafi zaroori hota hai.

Code:
import string

# Sample Input
text = "Hello!!! Aap kaisay hain???"

# Removing Punctuation
clean_text = [Link]([Link]("", "", [Link]))

print("Text without Punctuation:", clean_text)

Output:
Text without Punctuation: Hello Aap kaisay hain

6. Spelling Correction

Spelling errors ko correct karna ke liye hum TextBlob library ka use karte hain.

Code:
from textblob import TextBlob

# Sample Input
text = "Mujhy robtiks pasnd hai."

# Spelling Correction
corrected_text = TextBlob(text).correct()

print("Corrected Text:", corrected_text)

Output:
Corrected Text: Mujhe robotics pasand hai.

7. Text Encoding (Vectorization)

Text ko machine ke samajhne ke liye numerical form mein convert karna hota hai. Yeh kaafi
tareeqon se kiya jaa sakta hai:

Bag of Words (BoW):

from sklearn.feature_extraction.text import CountVectorizer

# Sample Input
corpus = ['Robotics interesting hai']

# Bag of Words
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

# Display the result

print("BoW Output:", [Link]())
print("Feature Names:", vectorizer.get_feature_names_out())
Output:
BoW Output: [[1 1 1]]
Feature Names: ['hai' 'interesting' 'robotics']
TF-IDF (Term Frequency-Inverse Document Frequency):
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample Input
corpus = ['Robotics interesting hai']

# TF-IDF Vectorization
tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(corpus)

# Display the result

print("TF-IDF Output:", X_tfidf.toarray())
print("Feature Names:", tfidf_vectorizer.get_feature_names_out())
Output:
TF-IDF Output: [[0.57735027 0.57735027 0.57735027]]
Feature Names: ['hai' 'interesting' 'robotics']
Word Embeddings (Word2Vec):

Word embeddings ko implement karne ke liye hum gensim library ka use kar sakte hain, lekin
yeh ek thoda complex example hai jo pre-trained models ka use karta hai.

from [Link] import Word2Vec

# Sample Input
sentences = [['Robotics', 'interesting', 'hai']]

# Train the model

model = Word2Vec(sentences, min_count=1)

# Word Embedding Example

vector = [Link]['robotics']

print("Word2Vec Embedding for 'robotics':", vector)

Conclusion:

Yeh sab steps NLP mein preprocessing ke liye kaafi zaroori hain. Ab aapko tokenization,
stopword removal, stemming & lemmatization, lowercasing, punctuation removal, spelling
correction, aur vectorization ke Python mein implementation samajh mein aa gaya hoga. Aap
in techniques ko apne NLP projects mein apply kar sakte hain! 😊

Agar koi aur doubt ho ya kisi specific code ki zarurat ho, toh zaroor batayein! 🚀

संध्यानंद - Sandhyanand
0% (1)
संध्यानंद - Sandhyanand
2 pages
Cost Estimation-Dogs Park
No ratings yet
Cost Estimation-Dogs Park
136 pages
Tubewell Sinking Estimate and Costs
100% (1)
Tubewell Sinking Estimate and Costs
8 pages
SSR 2023
No ratings yet
SSR 2023
170 pages
Marathi Songs 26 LAKSHMI ALI
No ratings yet
Marathi Songs 26 LAKSHMI ALI
1 page
Joglekar Estimating and Valuation
No ratings yet
Joglekar Estimating and Valuation
133 pages
Raigad District Schedule of Rates 2014-2015
No ratings yet
Raigad District Schedule of Rates 2014-2015
256 pages
10.00 Lakh Sabha Mandap at WADGAON 18-12-2023
No ratings yet
10.00 Lakh Sabha Mandap at WADGAON 18-12-2023
24 pages
Rajasthan Electrical B.S.R. 2022 Annexure A
No ratings yet
Rajasthan Electrical B.S.R. 2022 Annexure A
3 pages
Vyamsala Construction Estimate Fatiyabad
No ratings yet
Vyamsala Construction Estimate Fatiyabad
30 pages
A Presentation On: Unit VI: CAD Customization & Automation By, By
No ratings yet
A Presentation On: Unit VI: CAD Customization & Automation By, By
41 pages
CSR 2016-17
No ratings yet
CSR 2016-17
156 pages
Sessional Exam Questions Paper
No ratings yet
Sessional Exam Questions Paper
5 pages
DSR 2019-20 Nashik Region For Upload - Compressed
0% (1)
DSR 2019-20 Nashik Region For Upload - Compressed
252 pages
Nitya Pathachya 42 Ovya in Marathi
No ratings yet
Nitya Pathachya 42 Ovya in Marathi
9 pages
Curriculum Vitae Sanjib Kr. Rabha: Current Address Permanent Address E-Mail / Mobile
No ratings yet
Curriculum Vitae Sanjib Kr. Rabha: Current Address Permanent Address E-Mail / Mobile
4 pages
PWD Em2024
No ratings yet
PWD Em2024
256 pages
Rain Basera 1 (2) Model (1) .PDF Reain 12
No ratings yet
Rain Basera 1 (2) Model (1) .PDF Reain 12
1 page
CSR-2013-2014 PWD
45% (11)
CSR-2013-2014 PWD
194 pages
How To Generate G8 Receipt On NDC
No ratings yet
How To Generate G8 Receipt On NDC
10 pages
Dsr-2023 in Ms Excel
No ratings yet
Dsr-2023 in Ms Excel
265 pages
EDFC and IR Stations Contact List 06 24
No ratings yet
EDFC and IR Stations Contact List 06 24
2 pages
Resume Format
No ratings yet
Resume Format
1 page
Regional Shedule & Rates Pune
No ratings yet
Regional Shedule & Rates Pune
23 pages
Government of Gujarat: Pre - Qualification Bid
No ratings yet
Government of Gujarat: Pre - Qualification Bid
51 pages
SSR 2022-23
100% (1)
SSR 2022-23
1,825 pages
WRD CSR Abstract 2023-24
No ratings yet
WRD CSR Abstract 2023-24
193 pages
Toll-2019.01.07-Implementation of ETC Through RFID at All Fee Plazas Under MoRTH and Stretch To Be Transferred To NHAI
No ratings yet
Toll-2019.01.07-Implementation of ETC Through RFID at All Fee Plazas Under MoRTH and Stretch To Be Transferred To NHAI
6 pages
PWD 2015 Sor
79% (38)
PWD 2015 Sor
231 pages
Bridge Design Madhya Ganga Criteria
No ratings yet
Bridge Design Madhya Ganga Criteria
11 pages
Sor 2024 25
No ratings yet
Sor 2024 25
18 pages
DSR 16 Revised After GST PDF
No ratings yet
DSR 16 Revised After GST PDF
99 pages
Uppwd T2
100% (1)
Uppwd T2
71 pages
Rate Analysis Runichak Pump House
No ratings yet
Rate Analysis Runichak Pump House
7 pages
Thumb Rules
No ratings yet
Thumb Rules
8 pages
DSR em 2025
No ratings yet
DSR em 2025
471 pages
4 Computer Sc. Optional II Syllabus 2025-26
No ratings yet
4 Computer Sc. Optional II Syllabus 2025-26
30 pages
Research Paper: TITLE: Artificial Intelligence and Machine Learning
No ratings yet
Research Paper: TITLE: Artificial Intelligence and Machine Learning
8 pages
Office Memorandum No - DGW/MAN/184 Issued by Authority of Director General of Works
No ratings yet
Office Memorandum No - DGW/MAN/184 Issued by Authority of Director General of Works
2 pages
viewNitPdf 4884734
No ratings yet
viewNitPdf 4884734
8 pages
92 Siwan
No ratings yet
92 Siwan
13 pages
Civil Sor II 2022 Final
No ratings yet
Civil Sor II 2022 Final
482 pages
Clause No.1: G.P.W.-9 General Conditions of Contract
0% (1)
Clause No.1: G.P.W.-9 General Conditions of Contract
19 pages
Market Rate Analysis 23rd Jan, 2025
No ratings yet
Market Rate Analysis 23rd Jan, 2025
10 pages
DSR April 2024
No ratings yet
DSR April 2024
10 pages
New NAINA Office Address Telephone Number
No ratings yet
New NAINA Office Address Telephone Number
3 pages
List of Indian Standard Code For Civil and Structural Works PDF
67% (3)
List of Indian Standard Code For Civil and Structural Works PDF
148 pages
Attention Is All You Need Hinglish
No ratings yet
Attention Is All You Need Hinglish
4 pages
Deep Learning Detailed Notes
No ratings yet
Deep Learning Detailed Notes
4 pages
NLP CrossValidation DataPreprocessing UrduEnglish
No ratings yet
NLP CrossValidation DataPreprocessing UrduEnglish
3 pages
Deep Learning Unit - 1 Fundamentals of Deep Learning-Sppu by JK Coding Pathshala
No ratings yet
Deep Learning Unit - 1 Fundamentals of Deep Learning-Sppu by JK Coding Pathshala
109 pages
ML Interview Questions Hinglish Extended
No ratings yet
ML Interview Questions Hinglish Extended
6 pages
Imp ML
No ratings yet
Imp ML
8 pages
AI & Machine Learning Learning Roadmap For Students
100% (1)
AI & Machine Learning Learning Roadmap For Students
5 pages
ML Unit 2 Hindi
No ratings yet
ML Unit 2 Hindi
16 pages
OCI AI Foundations
No ratings yet
OCI AI Foundations
13 pages
Clevered AI Wizard Level 3
No ratings yet
Clevered AI Wizard Level 3
17 pages
Ch4 and Ch5 Notes
No ratings yet
Ch4 and Ch5 Notes
38 pages
Advanced Machine Learning Techniques
No ratings yet
Advanced Machine Learning Techniques
4 pages
15879A - REVQUOTATION FORM - Provision of Robot 2 Access Platform (Design Option 1) @K2C - @estrella - @arciaga
No ratings yet
15879A - REVQUOTATION FORM - Provision of Robot 2 Access Platform (Design Option 1) @K2C - @estrella - @arciaga
1 page
Transcommunication 8-2-12
No ratings yet
Transcommunication 8-2-12
13 pages
Data Analysis With R: Sai Vaibhavi Tulasi
No ratings yet
Data Analysis With R: Sai Vaibhavi Tulasi
2 pages
NPK Under Water Manual
No ratings yet
NPK Under Water Manual
18 pages
Lucca 40 14 05 25
No ratings yet
Lucca 40 14 05 25
2 pages
Personal Safety App
No ratings yet
Personal Safety App
30 pages
RFQ - Section - III - Returnable Bidding Forms-RFQ-2023-50039 - Rev 1
No ratings yet
RFQ - Section - III - Returnable Bidding Forms-RFQ-2023-50039 - Rev 1
11 pages
MongoDB Installation and Java Integration Guide
No ratings yet
MongoDB Installation and Java Integration Guide
6 pages
Nox Nerbo WPT Review
No ratings yet
Nox Nerbo WPT Review
4 pages
Huizhou Fudi Electrical Appliances Limited Company: 3. Mechanical Drawing
No ratings yet
Huizhou Fudi Electrical Appliances Limited Company: 3. Mechanical Drawing
4 pages
Tantrabhidhanam With Bijanighantu & Mudranighantu PDF
No ratings yet
Tantrabhidhanam With Bijanighantu & Mudranighantu PDF
1 page
Bodine Electric 42A5BEPM Datasheet 2019821184626
No ratings yet
Bodine Electric 42A5BEPM Datasheet 2019821184626
3 pages
MICROTEACHING ON POLYNOMIALS SHASHIKANTA BEHERA B Ed M Ed Integrated
No ratings yet
MICROTEACHING ON POLYNOMIALS SHASHIKANTA BEHERA B Ed M Ed Integrated
7 pages
Vizio vx52l Fhdtv10a
No ratings yet
Vizio vx52l Fhdtv10a
31 pages
Qpaper 1
No ratings yet
Qpaper 1
6 pages
ACC IMP Question
No ratings yet
ACC IMP Question
3 pages
Fourth Admission List - 2079
No ratings yet
Fourth Admission List - 2079
2 pages
Network Automation Using Python
No ratings yet
Network Automation Using Python
9 pages
CEIG Charging Partial Charging Request Letter
No ratings yet
CEIG Charging Partial Charging Request Letter
1 page
MTC Attach A Puller Specs
No ratings yet
MTC Attach A Puller Specs
4 pages
Aska Light
No ratings yet
Aska Light
2 pages
Embedded System in Printer
100% (1)
Embedded System in Printer
15 pages
Understanding PLDs: ROM, PLA, and PAL
No ratings yet
Understanding PLDs: ROM, PLA, and PAL
14 pages
PPR - 1 - Abhishek Jayant Roll 15
No ratings yet
PPR - 1 - Abhishek Jayant Roll 15
6 pages
LRS 350N2 Spec
No ratings yet
LRS 350N2 Spec
5 pages
Planar Transformer Winding Technique For Reduced Capacitance in LLC Power Converters
No ratings yet
Planar Transformer Winding Technique For Reduced Capacitance in LLC Power Converters
7 pages
Understanding Priority Queues in Data Structures
No ratings yet
Understanding Priority Queues in Data Structures
9 pages
Kaur y Singh 2017 - Ai Based Healthcare Plataform For Real Time Predictive and Prescriptive Analytics Using Reactive Programming
No ratings yet
Kaur y Singh 2017 - Ai Based Healthcare Plataform For Real Time Predictive and Prescriptive Analytics Using Reactive Programming
13 pages
CS-301 Mid Exam Paper Spring 2020
No ratings yet
CS-301 Mid Exam Paper Spring 2020
5 pages
50 Coding Laws That Would Make You A Decent Programmer. - by Alexander Obidiegwu - Medium
No ratings yet
50 Coding Laws That Would Make You A Decent Programmer. - by Alexander Obidiegwu - Medium
30 pages

NLPNotes

Uploaded by

NLPNotes

Uploaded by

NLP mein Data Preprocessing kya hoti hai?

3. Stemming aur Lemmatization

5. Punctuation aur Special Characters ko Hatana

7. Text Encoding (Vectorization)

Agar kisi specific step ka aur detail chahiye toh batao! 🚀

1. LSTM (Long Short-Term Memory)

LSTM ka Structure (Gates):

2. GRU (Gated Recurrent Unit)

GRU ka Structure (Gates):

1. Query, Key, aur Value:

Popular Transformer Models:

 BERT (Bidirectional Encoder Representations from Transformers): Yeh model context

from [Link] import word_tokenize

print("Tokenized Output:", tokens)

print("Stopword Removed:", filtered_tokens)

3. Stemming aur Lemmatization

# Initialize the stemmer

print("Stemming Output:", stemmed_tokens)

# Initialize the lemmatizer

print("Lemmatization Output:", lemmatized_tokens)

print("Lowercased Output:", lowercased_text)

5. Punctuation aur Special Characters ko Hatana

print("Text without Punctuation:", clean_text)

print("Corrected Text:", corrected_text)

7. Text Encoding (Vectorization)

Bag of Words (BoW):

# Display the result

# Display the result

from [Link] import Word2Vec

# Train the model

# Word Embedding Example

print("Word2Vec Embedding for 'robotics':", vector)

You might also like