0% found this document useful (0 votes)

24 views3 pages

Text Processor

The LEO Text Processor module processes text files to generate intents by cleaning the text, splitting it into sentences, and extracting key phrases. It includes methods for reading files, basic preprocessing, and utilizing spaCy for advanced phrase extraction, with a fallback to a simple approach if spaCy is unavailable. The module also provides progress and status updates during processing.

Uploaded by

raynyx77

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views3 pages

Text Processor

Uploaded by

raynyx77

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

You are on page 1/ 3

#!

/usr/bin/env python3
# -*- coding: utf-8 -*-

"""
LEO Text Processor

This module processes text files for intent generation.

"""

import os
import logging
import re
from collections import Counter

class TextProcessor:
"""Processes text files for intent generation."""

def __init__(self):
"""Initialize the text processor."""
self.on_progress = lambda p: None
self.on_status = lambda s: None

def process(self, file_path):

"""
Process a text file.

Args:
file_path (str): Path to the text file

Returns:
str: Processed text
"""
try:
self.on_status(f"Processing text file: {os.path.basename(file_path)}")
self.on_progress(10)

# Read file
with open(file_path, 'r', encoding='utf-8', errors='replace') as f:
text = f.read()

self.on_progress(30)

# Basic preprocessing
self.on_status("Cleaning text...")

# Remove extra whitespace

text = re.sub(r'\s+', ' ', text)

# Split into sentences

self.on_status("Splitting into sentences...")
sentences = self._split_into_sentences(text)

self.on_progress(70)

# Extract key phrases

self.on_status("Extracting key phrases...")
key_phrases = self._extract_key_phrases(sentences)

self.on_progress(90)
# Combine results
result = {
'text': text,
'sentences': sentences,
'key_phrases': key_phrases
}

self.on_progress(100)
self.on_status("Text processing complete")

return result

except Exception as e:
logging.error(f"Error processing text file: {str(e)}", exc_info=True)
raise

def _split_into_sentences(self, text):

"""
Split text into sentences.

Args:
text (str): Text to split

Returns:
list: List of sentences
"""
# Simple sentence splitting
sentences = re.split(r'(?<=[.!?])\s+', text)

# Filter out empty sentences

sentences = [s.strip() for s in sentences if s.strip()]

return sentences

def _extract_key_phrases(self, sentences):

"""
Extract key phrases from sentences.

Args:
sentences (list): List of sentences

Returns:
list: List of key phrases
"""
# Try to use spaCy if available
try:
import spacy

# Load spaCy model

nlp = spacy.load("en_core_web_sm")

key_phrases = []

for sentence in sentences:

doc = nlp(sentence)

# Extract noun phrases

for chunk in doc.noun_chunks:
if len(chunk.text.split()) > 1: # Only multi-word phrases
key_phrases.append(chunk.text)

# Extract verb phrases

for token in doc:
if token.pos_ == "VERB":
phrase = token.text
for child in token.children:
if child.dep_ in ["dobj", "pobj"]:
phrase += " " + child.text
key_phrases.append(phrase)

return key_phrases

except ImportError:
# Fallback to simple approach if spaCy is not available
logging.warning("spaCy not available, using simple key phrase
extraction")

# Tokenize
words = []
for sentence in sentences:
words.extend(sentence.lower().split())

# Count word frequencies

word_counts = Counter(words)

# Get common bigrams

bigrams = []
for i in range(len(words) - 1):
bigrams.append(words[i] + " " + words[i + 1])

bigram_counts = Counter(bigrams)

# Return top phrases

return [phrase for phrase, count in bigram_counts.most_common(20)]

PDF Processor
No ratings yet
PDF Processor
4 pages
Lab Assignment-10
No ratings yet
Lab Assignment-10
1 page
Bling
No ratings yet
Bling
7 pages
NLP Lab1
No ratings yet
NLP Lab1
6 pages
Natural Language Processing Lab Manual
No ratings yet
Natural Language Processing Lab Manual
24 pages
NLP PRGRM-1
No ratings yet
NLP PRGRM-1
7 pages
NLP Lab - Manual
No ratings yet
NLP Lab - Manual
33 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
15 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
7 pages
Natural Language Processing Lab 7
No ratings yet
Natural Language Processing Lab 7
10 pages
1a NLTK
No ratings yet
1a NLTK
10 pages
Module 5
No ratings yet
Module 5
69 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
17 pages
Natural Language Processing
No ratings yet
Natural Language Processing
17 pages
NLP Practical Journal 2023-24
No ratings yet
NLP Practical Journal 2023-24
22 pages
Python
No ratings yet
Python
9 pages
NLP Techniques for Text Processing
No ratings yet
NLP Techniques for Text Processing
41 pages
Python File Handling Guide
No ratings yet
Python File Handling Guide
22 pages
NLP Day1
No ratings yet
NLP Day1
4 pages
NLP Text Processing Techniques
No ratings yet
NLP Text Processing Techniques
6 pages
Ass 3
No ratings yet
Ass 3
3 pages
NLP Smitpatel
No ratings yet
NLP Smitpatel
32 pages
Batch 2
No ratings yet
Batch 2
13 pages
AI Lab Manual Aktu
No ratings yet
AI Lab Manual Aktu
11 pages
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
No ratings yet
Tokenization (Breaking Text Into Words) : Import From Import From Import From Import
11 pages
Python Text Processing Techniques
No ratings yet
Python Text Processing Techniques
13 pages
NLP Lab Codes Till Mod3
No ratings yet
NLP Lab Codes Till Mod3
7 pages
NLP Lab Manual for CSE Students
No ratings yet
NLP Lab Manual for CSE Students
28 pages
Unit 5
No ratings yet
Unit 5
4 pages
File Handling Practice Red
No ratings yet
File Handling Practice Red
3 pages
NLP
No ratings yet
NLP
12 pages
NLP Practical Journal
No ratings yet
NLP Practical Journal
36 pages
NLP Lab Manual - Final
No ratings yet
NLP Lab Manual - Final
15 pages
Exercise 1
No ratings yet
Exercise 1
3 pages
Lab File Complete
No ratings yet
Lab File Complete
10 pages
IR Assignment6
No ratings yet
IR Assignment6
5 pages
1.2 - Handling Text in Python
No ratings yet
1.2 - Handling Text in Python
14 pages
Jal Patel NLP
No ratings yet
Jal Patel NLP
32 pages
NLP Session 4
No ratings yet
NLP Session 4
13 pages
Text Processing with NLTK in Python
No ratings yet
Text Processing with NLTK in Python
16 pages
Natural Language Processing in Python - Exploring Word Frequencies With NLTK
No ratings yet
Natural Language Processing in Python - Exploring Word Frequencies With NLTK
5 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
19 pages
Ai&Ml Bai601 NLP Lab Manual
No ratings yet
Ai&Ml Bai601 NLP Lab Manual
48 pages
NLP Record
No ratings yet
NLP Record
23 pages
NLP Lab Programs
No ratings yet
NLP Lab Programs
18 pages
CS Practical File
No ratings yet
CS Practical File
47 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
32 pages
NLP Prac 5
No ratings yet
NLP Prac 5
6 pages
Self Evaluation Exercises
No ratings yet
Self Evaluation Exercises
12 pages
NLP Program 1
No ratings yet
NLP Program 1
3 pages
NLP Tasks for MCA Students
No ratings yet
NLP Tasks for MCA Students
16 pages
NLPPractical
No ratings yet
NLPPractical
12 pages
Shubham Jade MSC It 31031420010 NLP Practical Journal
No ratings yet
Shubham Jade MSC It 31031420010 NLP Practical Journal
17 pages
NLP - (Natural Language Processing Lab Manual)
No ratings yet
NLP - (Natural Language Processing Lab Manual)
12 pages
20BCP112 - NLP Lab - LAB - Manual
No ratings yet
20BCP112 - NLP Lab - LAB - Manual
65 pages
Ai Lab Rishabh
No ratings yet
Ai Lab Rishabh
2 pages
NLP Lab Manual for CSE Students
No ratings yet
NLP Lab Manual for CSE Students
45 pages
Tsarecord
No ratings yet
Tsarecord
22 pages
Dataset Manager
No ratings yet
Dataset Manager
6 pages
ARC Prize 2025 Paper Submission
No ratings yet
ARC Prize 2025 Paper Submission
13 pages
Figure1 Belief in Pseudoscience
No ratings yet
Figure1 Belief in Pseudoscience
1 page
Research
No ratings yet
Research
9 pages
System Development Life Cycle Overview
No ratings yet
System Development Life Cycle Overview
10 pages
ROTOTEST Energy: Efficient Hub-Mounted Chassis Dynamometer
No ratings yet
ROTOTEST Energy: Efficient Hub-Mounted Chassis Dynamometer
8 pages
LG CHEM Lithium Battery Energy Solution Brochure
No ratings yet
LG CHEM Lithium Battery Energy Solution Brochure
3 pages
Genshin Impact Account Details
No ratings yet
Genshin Impact Account Details
1 page
Avian Chakma
No ratings yet
Avian Chakma
8 pages
APC Blower Operation
No ratings yet
APC Blower Operation
5 pages
Microcontroller Basics for Engineers
No ratings yet
Microcontroller Basics for Engineers
9 pages
CIRCULAR-For Special Incentive - 15022022
No ratings yet
CIRCULAR-For Special Incentive - 15022022
7 pages
Whatiswindows?: The Desktop - Desktop Think of The Desktop As The Main Workspace For Your Computer
No ratings yet
Whatiswindows?: The Desktop - Desktop Think of The Desktop As The Main Workspace For Your Computer
4 pages
Diesel
100% (1)
Diesel
9 pages
10-1030 For EMS - 30-1030
No ratings yet
10-1030 For EMS - 30-1030
12 pages
Lab Manual 2 Installing of Ms Word
No ratings yet
Lab Manual 2 Installing of Ms Word
5 pages
KN-2000W Multi-Input Indicator Specs
No ratings yet
KN-2000W Multi-Input Indicator Specs
12 pages
Class 10 Computer Application Sample Paper Set 14
No ratings yet
Class 10 Computer Application Sample Paper Set 14
8 pages
Ec-Council.412-79v8 Q1 - 47
No ratings yet
Ec-Council.412-79v8 Q1 - 47
24 pages
Ravaglioli Brake Tester
No ratings yet
Ravaglioli Brake Tester
10 pages
Q3 Week 2 TNCT
No ratings yet
Q3 Week 2 TNCT
24 pages
Checking and Maint. Schedule A90050-0295
No ratings yet
Checking and Maint. Schedule A90050-0295
7 pages
Ugandan Mobile Banking Impact
No ratings yet
Ugandan Mobile Banking Impact
73 pages
FortiClient Security Quiz Review
100% (1)
FortiClient Security Quiz Review
4 pages
01第一课：自然语言与数学之美
No ratings yet
01第一课：自然语言与数学之美
28 pages
TissueScope Iq Brochure
No ratings yet
TissueScope Iq Brochure
2 pages
Navrhy Ozvucnic Beyma 3 Fr106
No ratings yet
Navrhy Ozvucnic Beyma 3 Fr106
3 pages
Computer Network
No ratings yet
Computer Network
1 page
Olt Zte c300 (Arturo)
No ratings yet
Olt Zte c300 (Arturo)
7 pages
SOP - Employee Records Handling
No ratings yet
SOP - Employee Records Handling
3 pages
Learning Delivery Modalities in Education
100% (11)
Learning Delivery Modalities in Education
45 pages
Juspay Interview Experience
No ratings yet
Juspay Interview Experience
2 pages
Text To PDF - 28052024 - 151236
No ratings yet
Text To PDF - 28052024 - 151236
36 pages

Text Processor

Uploaded by

Text Processor

Uploaded by

#!

This module processes text files for intent generation.

def process(self, file_path):

# Remove extra whitespace

# Split into sentences

# Extract key phrases

def _split_into_sentences(self, text):

# Filter out empty sentences

def _extract_key_phrases(self, sentences):

# Load spaCy model

for sentence in sentences:

# Extract noun phrases

# Extract verb phrases

# Count word frequencies

# Get common bigrams

# Return top phrases

You might also like