Skip to content

Introduction

Output Drift in Financial LLMs Workshop

Welcome to the Output Drift in Financial LLMs Workshop! This hands-on workshop teaches you how to measure and analyze non-determinism in large language model (LLM) outputs for financial applications.

Why This Matters

Financial institutions deploying AI systems must ensure:

  • Regulatory Compliance: Consistent, auditable AI decisions
  • Risk Management: Predictable behavior in production
  • Trust & Reliability: Stakeholder confidence in AI-driven recommendations

This workshop is based on peer-reviewed research demonstrating that even at temperature=0.0, LLMs exhibit output drift—up to 35% variance in some tasks—threatening compliance workflows.

What You'll Learn

By the end of this workshop, you will:

  • Understand output drift and its implications for financial AI systems
  • Set up and run reproducible LLM experiments across multiple providers
  • Measure drift using industry-standard metrics (consistency, Jaccard similarity, schema violations)
  • Analyze cross-provider reliability patterns
  • Implement best practices for deterministic AI deployments

Tip

This workshop is hands-on and collaborative. We encourage you to experiment, ask questions, and share your findings with other participants. The framework is designed to be extensible—feel free to add your own tasks and providers!

Workshop Structure

Lab Description Duration
Lab 0: Workshop Pre-work Install prerequisites and set up your environment 15 min
Lab 1: Understanding Output Drift Learn the theory and see real examples of drift 20 min
Lab 2: Setting Up Your Environment Configure API keys and run environment tests 15 min
Lab 3: Running Your First Experiment Execute experiments and understand the framework 30 min
Lab 4: Analyzing Drift Metrics Interpret results and generate visualizations 25 min
Lab 5: Cross-Provider Testing Compare reliability across different AI providers 30 min
Lab 6: Extending the Framework Add custom tasks and integrate with your workflows 30 min
Lab 7: Replayable Financial Agents Run agent benchmarks from the ICLR 2026 paper 30 min

Total Duration: Approximately 3-3.5 hours

Research Foundation

This workshop is based on two peer-reviewed papers:

"Replayable Financial Agents: A Determinism-Faithfulness Assurance Harness for Tool-Using LLM Agents" Accepted at the ICLR 2026 FinAI Workshop (The 2nd ICLR Workshop on Advances in Financial AI) | arXiv:2601.15322

"LLM Output Drift: Cross-Provider Validation & Mitigation for Financial Workflows" Presented at the ACM ICAIF 2025 AI4F Workshop | arXiv:2511.07585

Key Findings (v1, Output Drift): - Even at temperature=0.0, frontier models exhibit 5.5-35% output variance - 7-20B models (Granite-3-8B, Qwen2.5-7B, GPT-OSS-20B) achieve 100% determinism at T=0.0 - RAG tasks show the highest drift (56.25% consistency at temperature=0.2) - Structured output tasks (SQL, summarization) maintain better determinism

Key Findings (v2, Replayable Agents, 4,705 runs): - Decision determinism and accuracy are not detectably correlated (r = -0.11, p = 0.63) - Small models achieve high determinism via pattern matching, not reasoning - Frontier models show "same conclusion, different reasoning" (decision det. > signature det.) - No model simultaneously achieves high determinism AND high accuracy

Community Validation (Paul Merrison, FINOS): - Determinism is model-specific, not size-based - Gemma2-9B: 100% deterministic (new Tier 1 candidate) - Mistral-7B: Task-dependent (33% RAG, 100% SQL) - Architecture and training matter more than parameter count

Prerequisites

Required: - Python 3.11+ - Basic command line proficiency - Understanding of APIs and environment variables

Recommended: - Familiarity with LLMs and prompt engineering - Basic knowledge of financial concepts - Experience with data analysis (pandas, visualization)

API Access (at least one): - Ollama (free, local) - IBM watsonx.ai (trial available) - OpenAI, Anthropic, or other providers

Target Audience

This workshop is designed for:

  • AI/ML Engineers building production LLM systems
  • Risk & Compliance Officers evaluating AI deployments
  • Financial Technologists integrating AI into workflows
  • Researchers studying LLM reliability and non-determinism
  • Product Managers planning AI-powered financial products

Getting Help

If you encounter issues or have questions:

  1. Check the Troubleshooting Guide
  2. Review the API Reference
  3. Ask workshop facilitators or teaching assistants
  4. Open an Issue on GitHub
  5. Submit a Pull Request with improvements

Repository Structure

output-drift-financial-llms/
├── run_evaluation.py       # Main experiment orchestrator (v1 Output Drift)
├── make_tables.py          # Generate LaTeX tables from results
├── plot_results.py         # Generate drift visualizations
├── COMMUNITY_FINDINGS.md   # Independent validation results
├── docs/                   # Workshop documentation (labs 0-7)
├── harness/                # Core framework code
│   ├── deterministic_retriever.py
│   ├── task_definitions.py
│   └── cross_provider_validation.py
├── providers/              # LLM providers (watsonx, anthropic, gemini)
├── econometrics/           # Replayable Agents (v2) - benchmarks & metrics
│   ├── benchmarks/         # 3 financial agent benchmarks (50 cases each)
│   └── agentic/            # Trajectory determinism & faithfulness metrics
├── scripts/                # Data fetching & utilities
├── data/                   # Test datasets & generators
├── examples/               # Sample audit trails
└── requirements.txt        # Python dependencies

Reproducibility & Citations

All experiments use release v0.1.0 (commit c19dac5) for reproducibility:

git clone https://github.com/ibm-client-engineering/output-drift-financial-llms
git checkout v0.1.0

If you use this framework in your research, please cite:

@inproceedings{khatchadourian2026replayable,
  title={Replayable Financial Agents: A Determinism-Faithfulness Assurance Harness for Tool-Using LLM Agents},
  author={Khatchadourian, Raffi},
  booktitle={The 2nd ICLR Workshop on Advances in Financial AI (FinAI)},
  year={2026},
  url={https://arxiv.org/abs/2601.15322}
}

@inproceedings{khatchadourian2025output,
  title={LLM Output Drift: Cross-Provider Validation \& Mitigation for Financial Workflows},
  author={Khatchadourian, Raffi and Franco, Rolando},
  booktitle={ACM International Conference on AI in Finance (ICAIF), AI4F Workshop},
  year={2025},
  url={https://arxiv.org/abs/2511.07585}
}

Replayable Agents: arXiv:2601.15322 | Output Drift: arXiv:2511.07585

License

This project is licensed under the MIT License. See LICENSE for details.

Contributors & Acknowledgments

This workshop and framework were developed by Raffi Khatchadourian and Rolando Franco in IBM Financial Services in collaboration with researchers focused on responsible AI deployment in regulated industries.

Special thanks to the open-source community and the contributors who helped build and test this framework.


Ready to Begin?

Start with Lab 0: Workshop Pre-work to set up your environment!