Sentiment Analysis Project Report
Amr Khaled 21100834 Amira Ali 21100789
1. Introduction:
analysis is the process of analyzing text data to determine the sentiment or
emotion it conveys, such as positive, negative, or neutral. This project uses
transformer-based architecture, specifically Distil BERT, to classify tweets
into three categories: positive, neutral, and negative.
2. Dataset Details:
Columns:
• id: Unique identifier for each tweet.
• label: Sentiment label (0 for neutral, 1 for positive, -1 for negative).
• tweet: The text of the tweet.
3. Methodology
3.1. Text Preprocessing
• Removed URLs, HTML tags, and special characters using regular expressions.
• Stripped extra whitespace.
• Ensured all text was converted to lowercase.
3.2. Tokenization
• Used the DistilBertTokenizer to tokenize the text data, ensuring padding and
truncation to a fixed length of 128 tokens.
3.3. Data Splitting
• Split the dataset into training (70%) and testing (30%) sets using train_test_split().
• Encoded labels into numeric values using LabelEncoder.
3.4. Model Selection
• Selected DistilBERT (distilbert-base-uncased) for its efficiency and accuracy in text
classification tasks.
• Added a classification head with three output neurons for multi-class classification.
3.5. Training
• Optimizer: AdamW with a learning rate of 2e-5.
• Batch Size: 16.
• Epochs: 3.
• Used the training dataset to fine-tune the pre-trained Distil BERT model.
ؤم
Conclusion :
This project successfully built a sentiment analysis model using DistilBERT,
achieving an accuracy of 85%. With further improvements and dataset
augmentation, the model’s performance can be enhanced for real-world
applications.