Gemma 3 1B Comprehensive Model Evaluation

This project conducts a multi-dimensional evaluation of Google's Gemma 3 1B model through five distinct assessment streams, validated using the larger Gemma 3 27B model as a reference.

Overview

We're investigating the capabilities and limitations of the Gemma 3 1B model (approximately 815MB on disk) through a comprehensive evaluation framework that tests different aspects of its performance.

Evaluation Focus

Knowledge Compression: How effectively can a 1B model retain and utilize complex knowledge?
Hallucination Resistance: How well does it handle false premises and uncertain information?
Problem-Solving: Can it tackle mathematical and logical challenges effectively?
Reasoning: How robust is its analytical and inferential capability?
Consistency: Does it maintain stable outputs across different phrasings of the same query?

Project Structure

.
├── streams/
│   ├── knowledge/           # Factual knowledge tests
│   ├── hallucination/       # False premise detection
│   ├── problem_solving/     # Mathematical and logical problems
│   ├── reasoning/           # Analysis and inference
│   └── consistency/         # Answer stability tests
├── query_gemma.py          # Stream-aware query script
├── validate_answers.py      # Multi-stream validation
├── generate_final_assessment.py  # Cross-stream analysis
└── requirements.txt         # Python dependencies

Evaluation Streams

1. Knowledge Stream

Tests factual knowledge across diverse domains
Evaluates depth and breadth of understanding
Measures knowledge compression efficiency

2. Hallucination Stream

Presents questions with false premises
Tests ability to detect and reject misinformation
Evaluates uncertainty handling

3. Problem-Solving Stream

Mathematical reasoning challenges
Logic puzzles and algorithmic problems
Step-by-step solution evaluation

4. Reasoning Stream

Complex analytical scenarios
Causal and inferential reasoning
System thinking and pattern recognition

5. Consistency Stream

Paired questions testing same knowledge
Cross-reference answer stability
Evaluates contextual awareness

Evaluation Framework

Query Phase

Uses Ollama to run Gemma 3 1B locally
Stream-specific prompting strategies
Temperature set to 0 for deterministic answers
Structured JSON output format
Efficient batch processing

Validation Phase

Gemma 3 27B as evaluator via OpenRouter
Stream-specific evaluation metrics
Detailed qualitative feedback
Comprehensive statistical analysis

Assessment Metrics

Common Metrics (All Streams)

Accuracy (0-10)
Reasoning (0-10)
Completeness (0-10)

Stream-Specific Metrics

Knowledge Stream
- Factual correctness
- Source alignment
Hallucination Stream
- Uncertainty awareness
- False premise detection
- Invention score
Problem-Solving Stream
- Methodology
- Step clarity
- Solution correctness
Reasoning Stream
- Logical coherence
- Analysis depth
- Assumption awareness
Consistency Stream
- Fact stability
- Context awareness
- Uncertainty disclosure

Analysis Framework

Per-stream statistical analysis
Cross-stream performance metrics
Strength/weakness identification
Practical usage recommendations

Setup and Usage

Prerequisites

Python 3.x
Ollama with Gemma 3 1B model installed
OpenRouter API key (for validation)

Installation

pip install -r requirements.txt

Running Evaluations

Query specific stream:

python3 query_gemma.py --stream knowledge

Query all streams:

python3 query_gemma.py --all

Validate specific stream:

export OPENROUTER_API_KEY='your_key_here'
python3 validate_answers.py --stream knowledge

Validate all streams:

python3 validate_answers.py --all

Generate final assessment:

python3 generate_final_assessment.py

Output Structure

streams/
├── knowledge/
│   ├── answers/          # Raw model responses
│   └── validated/        # Validation results
│       └── stream_assessment.json
├── hallucination/
├── problem_solving/
├── reasoning/
└── consistency/
final_assessment.json     # Cross-stream analysis

Expected Insights

Capability Profile
- Strengths and weaknesses across different tasks
- Task-specific performance characteristics
- Reliability in different contexts
Operational Guidelines
- Best-fit use cases
- Task-specific confidence levels
- Resource optimization strategies
Model Understanding
- Knowledge compression patterns
- Reasoning capabilities
- Limitation boundaries
Practical Applications
- Edge deployment recommendations
- Task suitability guidelines
- Integration best practices

Technical Details

Stream-specific prompt engineering
Multi-dimensional evaluation metrics
Cross-stream analysis methodology
Statistical validation framework

Key Findings

Strong Performance Areas (>8.5/10)

Scientific & Technical Knowledge
- Biology (9.0/10)
- Physical Laws (10/10)
- Basic Mathematics (10/10)
- Scientific Reasoning (9.0/10)
Consistency & Clarity
- Fact Stability (9.2/10)
- Context Awareness (8.5/10)
- Step-by-Step Clarity (8.5/10)
- Logical Coherence (8.9/10)
Core Reasoning
- Deductive Reasoning (9.0/10)
- Logical Flow (8.9/10)
- Assumption Awareness (7.9/10)

Areas Needing Larger Model (27B)

Complex Problem Solving
- Solution Correctness (5.2/10)
- Multi-step Calculations
- Advanced Proofs
Historical & Cultural
- Ancient History (6.0/10)
- Cultural Nuances
- Historical Causation
Edge Cases
- Novel Problems
- High-stakes Decisions
- Interdisciplinary Analysis

For detailed evaluation results and deployment recommendations, see ASSESSMENT.md.

Resource Requirements

1B Model

Memory: 861MB (INT4) to 4GB (FP32)
Suitable for edge deployment
Excellent for high-throughput, well-defined tasks

27B Model

Memory: 19.9GB (INT4) to 108GB (FP32)
Requires substantial compute
Better for complex, quality-critical tasks

Implications

This comprehensive evaluation provides clear guidelines for deploying the Gemma 3 1B model effectively:

Ideal Use Cases
- Scientific documentation
- Technical knowledge bases
- Fact verification systems
- Step-by-step guides
Caution Areas
- Complex problem solving
- Historical analysis
- Novel/edge cases
- High-stakes decisions

Understanding these performance boundaries helps optimize deployment strategies and guides effective model selection.

License

This project is open source and available under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
bin		bin
streams		streams
visualization		visualization
.gitignore		.gitignore
ASSESSMENT.md		ASSESSMENT.md
README.md		README.md
gemma3.png		gemma3.png
generate_final_assessment.py		generate_final_assessment.py
pyvenv.cfg		pyvenv.cfg
query_gemma.py		query_gemma.py
query_gemma_old.py		query_gemma_old.py
requirements.txt		requirements.txt
validate_answers.py		validate_answers.py
validate_answers_old.py		validate_answers_old.py

u1i/gemma-3-1b-eval

Folders and files

Latest commit

History

Repository files navigation

Gemma 3 1B Comprehensive Model Evaluation

Overview

Evaluation Focus

Project Structure

Evaluation Streams

1. Knowledge Stream

2. Hallucination Stream

3. Problem-Solving Stream

4. Reasoning Stream

5. Consistency Stream

Evaluation Framework

Query Phase

Validation Phase

Assessment Metrics

Common Metrics (All Streams)

Stream-Specific Metrics

Analysis Framework

Setup and Usage

Prerequisites

Installation

Running Evaluations

Output Structure

Expected Insights

Technical Details

Key Findings

Strong Performance Areas (>8.5/10)

Areas Needing Larger Model (27B)

Resource Requirements

1B Model

27B Model

Implications

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages