This project conducts a multi-dimensional evaluation of Google's Gemma 3 1B model through five distinct assessment streams, validated using the larger Gemma 3 27B model as a reference.
We're investigating the capabilities and limitations of the Gemma 3 1B model (approximately 815MB on disk) through a comprehensive evaluation framework that tests different aspects of its performance.
- Knowledge Compression: How effectively can a 1B model retain and utilize complex knowledge?
- Hallucination Resistance: How well does it handle false premises and uncertain information?
- Problem-Solving: Can it tackle mathematical and logical challenges effectively?
- Reasoning: How robust is its analytical and inferential capability?
- Consistency: Does it maintain stable outputs across different phrasings of the same query?
.
├── streams/
│ ├── knowledge/ # Factual knowledge tests
│ ├── hallucination/ # False premise detection
│ ├── problem_solving/ # Mathematical and logical problems
│ ├── reasoning/ # Analysis and inference
│ └── consistency/ # Answer stability tests
├── query_gemma.py # Stream-aware query script
├── validate_answers.py # Multi-stream validation
├── generate_final_assessment.py # Cross-stream analysis
└── requirements.txt # Python dependencies
- Tests factual knowledge across diverse domains
- Evaluates depth and breadth of understanding
- Measures knowledge compression efficiency
- Presents questions with false premises
- Tests ability to detect and reject misinformation
- Evaluates uncertainty handling
- Mathematical reasoning challenges
- Logic puzzles and algorithmic problems
- Step-by-step solution evaluation
- Complex analytical scenarios
- Causal and inferential reasoning
- System thinking and pattern recognition
- Paired questions testing same knowledge
- Cross-reference answer stability
- Evaluates contextual awareness
- Uses Ollama to run Gemma 3 1B locally
- Stream-specific prompting strategies
- Temperature set to 0 for deterministic answers
- Structured JSON output format
- Efficient batch processing
- Gemma 3 27B as evaluator via OpenRouter
- Stream-specific evaluation metrics
- Detailed qualitative feedback
- Comprehensive statistical analysis
- Accuracy (0-10)
- Reasoning (0-10)
- Completeness (0-10)
-
Knowledge Stream
- Factual correctness
- Source alignment
-
Hallucination Stream
- Uncertainty awareness
- False premise detection
- Invention score
-
Problem-Solving Stream
- Methodology
- Step clarity
- Solution correctness
-
Reasoning Stream
- Logical coherence
- Analysis depth
- Assumption awareness
-
Consistency Stream
- Fact stability
- Context awareness
- Uncertainty disclosure
- Per-stream statistical analysis
- Cross-stream performance metrics
- Strength/weakness identification
- Practical usage recommendations
- Python 3.x
- Ollama with Gemma 3 1B model installed
- OpenRouter API key (for validation)
pip install -r requirements.txt- Query specific stream:
python3 query_gemma.py --stream knowledge- Query all streams:
python3 query_gemma.py --all- Validate specific stream:
export OPENROUTER_API_KEY='your_key_here'
python3 validate_answers.py --stream knowledge- Validate all streams:
python3 validate_answers.py --all- Generate final assessment:
python3 generate_final_assessment.pystreams/
├── knowledge/
│ ├── answers/ # Raw model responses
│ └── validated/ # Validation results
│ └── stream_assessment.json
├── hallucination/
├── problem_solving/
├── reasoning/
└── consistency/
final_assessment.json # Cross-stream analysis
-
Capability Profile
- Strengths and weaknesses across different tasks
- Task-specific performance characteristics
- Reliability in different contexts
-
Operational Guidelines
- Best-fit use cases
- Task-specific confidence levels
- Resource optimization strategies
-
Model Understanding
- Knowledge compression patterns
- Reasoning capabilities
- Limitation boundaries
-
Practical Applications
- Edge deployment recommendations
- Task suitability guidelines
- Integration best practices
- Stream-specific prompt engineering
- Multi-dimensional evaluation metrics
- Cross-stream analysis methodology
- Statistical validation framework
-
Scientific & Technical Knowledge
- Biology (9.0/10)
- Physical Laws (10/10)
- Basic Mathematics (10/10)
- Scientific Reasoning (9.0/10)
-
Consistency & Clarity
- Fact Stability (9.2/10)
- Context Awareness (8.5/10)
- Step-by-Step Clarity (8.5/10)
- Logical Coherence (8.9/10)
-
Core Reasoning
- Deductive Reasoning (9.0/10)
- Logical Flow (8.9/10)
- Assumption Awareness (7.9/10)
-
Complex Problem Solving
- Solution Correctness (5.2/10)
- Multi-step Calculations
- Advanced Proofs
-
Historical & Cultural
- Ancient History (6.0/10)
- Cultural Nuances
- Historical Causation
-
Edge Cases
- Novel Problems
- High-stakes Decisions
- Interdisciplinary Analysis
For detailed evaluation results and deployment recommendations, see ASSESSMENT.md.
- Memory: 861MB (INT4) to 4GB (FP32)
- Suitable for edge deployment
- Excellent for high-throughput, well-defined tasks
- Memory: 19.9GB (INT4) to 108GB (FP32)
- Requires substantial compute
- Better for complex, quality-critical tasks
This comprehensive evaluation provides clear guidelines for deploying the Gemma 3 1B model effectively:
-
Ideal Use Cases
- Scientific documentation
- Technical knowledge bases
- Fact verification systems
- Step-by-step guides
-
Caution Areas
- Complex problem solving
- Historical analysis
- Novel/edge cases
- High-stakes decisions
Understanding these performance boundaries helps optimize deployment strategies and guides effective model selection.
This project is open source and available under the MIT License.
