Introduction to Speech Recognition
• • Converts spoken language into text
• • Also called Automatic Speech Recognition
(ASR)
• • Bridges the gap between human speech and
computers
Real-World Use Cases
• • Voice assistants: Siri, Alexa, Google Assistant
• • Speech-to-text transcription (e.g., YouTube
captions)
• • Voice search and commands
• • Automated customer service
• • Accessibility for differently-abled
Typical Speech Recognition
Pipeline
• 1. Audio Input
• 2. Preprocessing (noise removal,
normalization)
• 3. Feature Extraction (MFCCs, spectrograms)
• 4. Acoustic Model
• 5. Language Model
• 6. Decoder
Why Speech Recognition is Hard
• • Accents and dialects
• • Background noise
• • Speaker variations
• • Homophones (e.g., write vs right)
• • Real-time performance needs
Components of a Speech
Recognition System
• • Acoustic Model: Maps audio to phonemes
• • Language Model: Predicts word sequence
• • Lexicon: Phonemes to words
• • Decoder: Combines models for transcription
Tools for Speech Recognition
• • Google Speech-to-Text API
• • CMU Sphinx (PocketSphinx)
• • Mozilla DeepSpeech
• • Facebook Wav2Vec 2.0
• • Python’s SpeechRecognition library
Example in Python
• import speech_recognition as sr
• r = sr.Recognizer()
• with sr.Microphone() as source:
• print('Say something...')
• audio = r.listen(source)
• try:
• print('You said: ' +
Speech Recognition – Key Points
• • Converts voice to text using ML
• • Involves acoustic and language modeling
• • Many real-world applications
• • Python and APIs make it accessible