Ccs369 - Text and Speech Analysis - Lab Manual
Ccs369 - Text and Speech Analysis - Lab Manual
2. Detecting URLs:
import re
text = "Visit my website at https://www.example.com"
urls = re.findall(r'https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', text)
print(urls)
Output:
['https://www.example.com']
Output:
[('12', '31', '2023')]
Result:
Thus program was executed for all possible inputs.
2. Getting started with Python and NLTK - Searching Text, Counting Vocabulary, Frequency
Distribution, Collocations, Bigrams
Aim: To get start with Python and NLTK - Searching Text, Counting Vocabulary, Frequency
Distribution, Collocations, Bigrams
Algorithm:
1. Import NLTK and Download Necessary Resources
Import the NLTK library.
Download any necessary resources like tokenizers, stopwords, etc.
2. Load and Tokenize Text
Load the text you want to analyze.
Tokenize the text into individual words.
3. Count Vocabulary
Use a frequency distribution to count the occurrences of each word in the tokenized text.
4. Frequency Distribution Plot
Plot the frequency distribution to visualize the most common words.
5. Remove Stopwords
Remove common stopwords from the tokenized text to focus on meaningful words.
6. Collocations
Identify collocations, i.e., pairs of words that often occur together, in the text.
7. Bigrams
Generate bigrams, i.e., pairs of consecutive words, from the tokenized text.
8. Additional Analysis (Optional)
Perform additional analysis such as stemming, lemmatization, part-of-speech tagging, named
entity recognition, etc., depending on your specific requirements.
Program:
Step 1: Install NLTK
pip install nltk
Step 2: Import NLTK and Download Necessary Resources
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
Step 3: Load and Tokenize Text
from nltk.tokenize import word_tokenize
text = "Your text goes here."
tokens = word_tokenize(text.lower()) # Convert to lowercase for consistency
Step 4: Count Vocabulary
from nltk.probability import FreqDist
fdist = FreqDist(tokens)
print(fdist.most_common(10)) # Print 10 most common words and their frequencies
Output:
[('your', 1), ('text', 1), ('goes', 1), ('here', 1), ('.', 1)]
Step 5: Frequency Distribution
import matplotlib.pyplot as plt
fdist.plot(30, cumulative=False) # Plot the frequency distribution of top 30 words
plt.show()
Step 6: Remove Stopwords
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word not in stop_words]
Step 7: Collocations
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
bigram_measures = BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(filtered_tokens)
collocations = finder.nbest(bigram_measures.raw_freq, 10)
print(collocations)
Output:
[('text', 'goes'), ('goes', 'here')]
Step 8: Bigrams
from nltk import bigrams
bi_tokens = list(bigrams(filtered_tokens))
print(bi_tokens[:10]) # Print first 10 bigrams
Output:
[('text', 'goes'), ('goes', 'here')]
Result:
Thus program was executed for all possible inputs.
3. Accessing Text Corpora using NLTK in Python
Aim: To Access Text Corpora using NLTK in Python
Algorithm:
1. Import the necessary modules:
import nltk
from nltk.corpus import gutenberg
2. Download the Gutenberg corpus if not already downloaded:
nltk.download('gutenberg')
3. Get a list of file IDs in the Gutenberg corpus:
file_ids = gutenberg.fileids()
4. Print the first 5 file IDs:
for each file_id in file_ids[:5]:
print(file_id)
5. Print the raw text of the first book in the Gutenberg corpus:
raw_text = gutenberg.raw(file_ids[0])
print(raw_text[:500]) # Print the first 500 characters of the raw text
Program:
import nltk
from nltk.corpus import gutenberg
# Download the Gutenberg corpus if not already downloaded
nltk.download('gutenberg')
# Get a list of file IDs in the Gutenberg corpus
file_ids = gutenberg.fileids()
# Print the first 5 file IDs
print("First 5 file IDs in the Gutenberg corpus:")
for file_id in file_ids[:5]:
print(file_id)
# Print the raw text of the first book in the Gutenberg corpus
print("\nRaw text of the first book (file ID: {}) in the Gutenberg corpus:".format(file_ids[0]))
raw_text = gutenberg.raw(file_ids[0])
print(raw_text[:500]) # Print the first 500 characters
Output:
First 5 file IDs in the Gutenberg corpus:
austen-emma.txt
austen-persuasion.txt
austen-sense.txt
bible-kjv.txt
blake-poems.txt
Raw text of the first book (file ID: austen-emma.txt) in the Gutenberg corpus:
[Emma by Jane Austen 1816]
VOLUME I
CHAPTER I
Result:
Thus program was executed for all possible inputs.
4.Write a function that finds the 50 most frequently occurring words of a text that are not stop
words.
Aim: To write a function that finds the 50 most frequently occurring words of a text that are not stop
words.
Algorithm:
Algorithm: FindMostCommonWordsNotStopwords(text)
1. Tokenize the input text into words.
2. Initialize an empty list to store filtered words.
3. Iterate through each word in the tokenized words:
a. Check if the word is alphanumeric and not a stop word.
b. If conditions are met, add the word to the list of filtered words.
4. Count the occurrences of each word in the filtered list.
5. Get the 50 most common words from the word counts.
6. Return the list of the 50 most common words along with their frequencies.
Program:
import nltk
from nltk.corpus import stopwords
from collections import Counter
def most_common_words(text):
# Tokenize the text
words = nltk.word_tokenize(text.lower())
return most_common
# Example usage:
text = "Write a function that finds the 50 most frequently occurring words of a text that are not stop
words."
result = most_common_words(text)
print(result)
Output:
[('function', 1), ('finds', 1), ('50', 1), ('frequently', 1), ('occurring', 1), ('words', 1), ('text', 1), ('stop', 1)]
Result:
Thus program was executed for all possible inputs.
5. Implement the Word2Vec model
Algorithm:
import numpy as np
class Word2Vec:
def __init__(self, corpus, embedding_dim, window_size=2, learning_rate=0.01):
self.corpus = corpus
self.embedding_dim = embedding_dim
self.window_size = window_size
self.learning_rate = learning_rate
self.word2id = {}
self.id2word = {}
self.vocab_size = 0
self.training_data = []
self.initialize()
def initialize(self):
words = [word for sentence in self.corpus for word in sentence]
unique_words = set(words)
self.vocab_size = len(unique_words)
for i, word in enumerate(unique_words):
self.word2id[word] = i
self.id2word[i] = word
def generate_training_data(self):
for sentence in self.corpus:
for i, target_word in enumerate(sentence):
context_words = []
for j in range(i - self.window_size, i + self.window_size + 1):
if j != i and j >= 0 and j < len(sentence):
context_words.append(sentence[j])
if context_words:
self.training_data.append((context_words, target_word))
def initialize_weights(self):
self.input_weights = np.random.uniform(-1, 1, (self.vocab_size, self.embedding_dim))
self.output_weights = np.random.uniform(-1, 1, (self.embedding_dim, self.vocab_size))
# Example usage:
corpus = [["I", "love", "machine", "learning"], ["Word2Vec", "is", "awesome"]]
model = Word2Vec(corpus, embedding_dim=50, window_size=1, learning_rate=0.01)
model.train(epochs=100)
print(model.get_word_vector("machine"))
Output:
Result:
Thus program was executed for all possible inputs.
6. Use a transformer for implementing classification
Algorithm:
2. Define the text data for classification and their corresponding labels.
3. Tokenize the input texts using a pre-trained tokenizer (e.g., BERT tokenizer).
4. Split the data into train and test sets using train_test_split.
12. Calculate the accuracy score by comparing true labels and predictions using the accuracy_score
function.
import torch
from transformers import BertTokenizer, BertForSequenceClassification
from torch.utils.data import DataLoader, TensorDataset
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
train_masks = tokenized_texts.attention_mask[train_inputs]
test_masks = tokenized_texts.attention_mask[test_inputs]
# Create TensorDatasets
train_dataset = TensorDataset(train_inputs, train_masks, torch.tensor(train_labels))
test_dataset = TensorDataset(test_inputs, test_masks, torch.tensor(test_labels))
# Define DataLoader
train_loader = DataLoader(train_dataset, batch_size=2, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=2, shuffle=False)
# Training loop
epochs = 3
for epoch in range(epochs):
model.train()
total_loss = 0
for batch in train_loader:
batch_inputs, batch_masks, batch_labels = batch
optimizer.zero_grad()
outputs = model(batch_inputs, attention_mask=batch_masks, labels=batch_labels)
loss = outputs.loss
total_loss += loss.item()
loss.backward()
optimizer.step()
print(f"Epoch {epoch+1}, Loss: {total_loss/len(train_loader)}")
# Evaluation
model.eval()
predictions = []
true_labels = []
with torch.no_grad():
for batch in test_loader:
batch_inputs, batch_masks, batch_labels = batch
outputs = model(batch_inputs, attention_mask=batch_masks)
logits = outputs.logits
predictions.extend(torch.argmax(logits, dim=1).tolist())
true_labels.extend(batch_labels.tolist())
# Calculate accuracy
accuracy = accuracy_score(true_labels, predictions)
print(f"Accuracy: {accuracy}")
Output:
Accuracy: 0.95
Result:
Thus program was executed for all possible inputs.
7. Design a Chabot with a simple dialog system
Aim: To design a Chabot with a simple dialog system
Algorithm:
1. Define a dictionary responses where the keys are user inputs (e.g., "hi", "how are you?") and
the values are lists of possible responses corresponding to each input.
2. Define a function chatbot() to handle the chatbot interaction:
a. Print a welcome message.
b. Start a loop to continuously accept user input.
c. Convert the user input to lowercase for case insensitivity.
d. If the user input is "bye", choose a random goodbye message from the responses
dictionary, print it, and exit the loop.
e. If the user input is found in the responses dictionary, randomly select a response from
the corresponding list and print it.
f. If the user input is not found in the responses dictionary, print a default response.
3. Run the chatbot() function.
Program:
import random
# Define responses for different user inputs
responses = {
"hi": ["Hello!", "Hi there!", "Hey!"],
"how are you?": ["I'm good, thanks for asking!", "I'm doing well, how about you?"],
"what's your name?": ["I'm just a simple chatbot!", "You can call me ChatBot."],
"bye": ["Goodbye!", "See you later!", "Bye! Have a great day!"],
"default": ["Sorry, I didn't understand that.", "Could you please rephrase that?"]
}
def chatbot():
print("Welcome to the Simple ChatBot!")
print("You can start chatting with me. Type 'bye' to exit.")
while True:
user_input = input("You: ").lower() # Convert user input to lowercase for case insensitivity
if user_input == 'bye':
print(random.choice(responses["bye"]))
break
response = responses.get(user_input, responses["default"])
print("ChatBot:", random.choice(response))
# Run the chatbot
if __name__ == "__main__":
chatbot()
Output:
Welcome to the Simple ChatBot!
You can start chatting with me. Type 'bye' to exit.
You: hi
ChatBot: Hi there!
You: how are you?
ChatBot: I'm doing well, how about you?
You: What's your name?
ChatBot: You can call me ChatBot.
You: What is 2 + 2?
ChatBot: Sorry, I didn't understand that.
You: Bye
Goodbye!
Result:
Thus program was executed for all possible inputs.
8. Convert text to speech and find accuracy
Aim: To Convert text to speech and find accuracy
Algorithm:
1. Import Libraries:
Import the required libraries: gTTS, os, and difflib.
2. Define Text-to-Speech Function:
Create a function text_to_speech(text, filename) to convert the input text to speech.
Utilize the gTTS library to generate speech from the given text.
Save the generated speech as an audio file with the specified filename.
3. Define Accuracy Calculation Function:
Create a function calculate_accuracy(original_text, generated_text) to calculate the accuracy
between the original text and the generated speech.
Split both the original and generated text into words.
Use the SequenceMatcher from difflib to calculate the similarity ratio between the two sets of
words.
Convert the similarity ratio to a percentage (accuracy) and return it.
4. Main Function:
Define the main() function.
Provide a sample text to convert to speech.
Call the text_to_speech() function to generate speech from the sample text.
Read the generated speech from the saved file.
Calculate the accuracy between the original text and the generated speech using the
calculate_accuracy() function.
Print the accuracy.
5. Execution:
Call the main() function to execute the program.
6. Output:
Print the accuracy of the generated speech compared to the original text.
Program:
from gtts import gTTS
import os
import difflib
def text_to_speech(text, filename):
tts = gTTS(text=text, lang='en')
tts.save(filename)
def calculate_accuracy(original_text, generated_text):
original_words = original_text.split()
generated_words = generated_text.split()
matcher = difflib.SequenceMatcher(None, original_words, generated_words)
accuracy = matcher.ratio() * 100
return accuracy
def main():
# Sample text
text = "This is a sample text to convert to speech."
# Convert text to speech
text_to_speech(text, "generated_speech.mp3")
# Accuracy calculation
with open("generated_speech.txt", "r") as file:
generated_text = file.read().replace("\n", "")
accuracy = calculate_accuracy(text, generated_text)
print("Accuracy:", accuracy)
if __name__ == "__main__":
main()
Output:
Accuracy: 100.0
Result:
Thus program was executed for all possible inputs.
9.Design a speech recognition system and find the error rate
Aim: To design a speech recognition system and find the error rate
Algorithm:
1. Import the necessary libraries (e.g., SpeechRecognition).
2. Define a function `speech_recognition()` to recognize speech:
a. Initialize a recognizer object.
b. Use the default microphone as the audio source.
c. Adjust for ambient noise.
d. Capture audio input.
e. Try to recognize speech using Google Speech Recognition.
f. Handle exceptions for unknown value error and request error.
g. Return the recognized text or None if recognition fails.
3. Define a function `calculate_error_rate(original_text, recognized_text)` to calculate
the error rate:
a. Initialize a matrix to store Levenshtein distances.
b. Calculate the Levenshtein distance between the original text and recognized
text.
c. Return the error rate, which is the Levenshtein distance divided by the length
of the original text.
4. Define the main function:
a. Define the original text to compare with.
b. Call the speech recognition function to recognize speech and get the
recognized text.
c. If recognized text is not None:
i. Print the recognized text.
ii. Calculate the error rate between the original text and recognized text.
iii. Print the error rate.
5. Execute the main function.
Program:
import speech_recognition as sr
def speech_recognition():
recognizer = sr.Recognizer()
# Use the default microphone as the audio source
with sr.Microphone() as source:
print("Speak something:")
recognizer.adjust_for_ambient_noise(source) # Adjust for ambient noise
audio = recognizer.listen(source)
try:
# Recognize speech using Google Speech Recognition
text = recognizer.recognize_google(audio)
return text
except sr.UnknownValueError:
print("Sorry, could not understand audio")
return None
except sr.RequestError as e:
print("Could not request results; {0}".format(e))
return None
def calculate_error_rate(original_text, recognized_text):
# Calculate error rate using Levenshtein distance
if len(original_text) == 0:
return 0 if len(recognized_text) == 0 else 1
elif len(recognized_text) == 0:
return 1
matrix = [[0] * (len(recognized_text) + 1) for _ in range(len(original_text) + 1)]
for i in range(len(original_text) + 1):
matrix[i][0] = i
for j in range(len(recognized_text) + 1):
matrix[0][j] = j
for i in range(1, len(original_text) + 1):
for j in range(1, len(recognized_text) + 1):
if original_text[i - 1] == recognized_text[j - 1]:
substitution_cost = 0
else:
substitution_cost = 1
matrix[i][j] = min(matrix[i-1][j] + 1,
matrix[i][j-1] + 1,
matrix[i-1][j-1] + substitution_cost)
return matrix[len(original_text)][len(recognized_text)] / len(original_text)
def main():
original_text = "Hello, how are you?"
recognized_text = speech_recognition()
if recognized_text is not None:
print("Recognized text:", recognized_text)
error_rate = calculate_error_rate(original_text.lower(), recognized_text.lower())
print("Error rate:", error_rate)
if __name__ == "__main__":
main()
Output:
Speak something:
Recognized text: Hello how are you
Error rate: 0.1111111111111111
Result:
Thus program was executed for all possible inputs.