This project implements advanced agentic systems for complex inference tasks, focusing on Deep Research capabilities and Self-Refining generation strategies.
The repository focuses on two main agentic workflows:
A comprehensive research agent capable of executing multi-step research tasks.
- Core capabilities:
- Automated information gathering and synthesis
- Multi-step reasoning and planning
- Integration with external tools (Search, Browser)
- Key components:
react_agent.py: Implementation of the ReAct (Reasoning + Acting) paradigm.mcp_agents/: Modular Component Protocol (MCP) agents for extensible tool use.graph/: Graph-based reasoning utilities.
An agentic system that iteratively improves its own outputs through self-correction.
- Core capabilities:
- Self-evaluation of generated content
- Iterative refinement loops
- Performance analysis on benchmarks (MMLU, Graph tasks)
- Key components:
self_refine.py: Main logic for the self-refining loop.refine_modal.py: Modal integration for scalable execution.analyze_accuracy.py: Tools for evaluating refinement performance.
Tools for evaluating and selecting the best generations from multiple candidates.
- Implements various scoring mechanisms:
- Scalar Reward Models using Skywork/Reward-Llama
- Pairwise Reward Models using LLM-Blender (PairRM)
- MBR (Minimum Bayes Risk) decoding with BLEU and BERTScore
- Log-probability analysis using Qwen models
-
Install Dependencies:
pip install -r requirements.txt
-
Environment Variables: Create a
.envfile with the following keys:OPENAI_API_KEY=your_key ANTHROPIC_API_KEY=your_key # Add other provider keys as needed
Navigate to the deep_research_agent directory and configure the agent in react_agent.yaml.
python deep_research_agent/react_agent.pyTo run the self-refinement experiments:
python self_refine/self_refine.py --task mmlu --model qwen3-4bTo evaluate generated outputs using the reranking system:
python rerank_outputs.pyThis will process all_results_processed.json and compute scores for all candidates.
Use calculate_stats_reranking.py to generate statistical analysis and plots comparing different reranking strategies against gold-standard evaluations.