Lab 1: Understanding Output Drift¶
Overview¶
In this lab, you'll learn what output drift is, why it matters for financial AI systems, and see real examples of non-deterministic behavior in large language models.
Duration: ~20 minutes
Learning Objectives¶
By the end of this lab, you will:
- Understand what output drift is and how it differs from data drift
- Learn why temperature=0.0 doesn't guarantee determinism
- See real examples of drift in financial tasks
- Understand the regulatory implications for financial services
- Know which tasks are most susceptible to drift
What is Output Drift?¶
Output drift refers to inconsistent outputs from an LLM given identical inputs and settings. Even when temperature is set to 0.0 (supposed to be deterministic), models can produce different responses across repeated queries.
Why Does This Happen?¶
Several factors contribute to output drift:
- Non-deterministic Operations: GPU floating-point arithmetic, parallel processing
- Model Updates: Provider-side model changes without version control
- Infrastructure Variability: Load balancing, server selection
- Sampling Strategies: Even at temp=0.0, implementation details vary
Drift vs. Data Drift¶
| Concept | Definition | Scope |
|---|---|---|
| Output Drift | Inconsistent model responses for identical inputs | Model behavior |
| Data Drift | Changes in input data distribution over time | Input data |
This workshop focuses on output drift—the model's internal inconsistency.
Financial Impact: Real-World Scenarios¶
Scenario 1: Loan Approval Recommendations¶
Input: "Analyze credit risk for applicant with 680 credit score, $75K income,
20% debt-to-income ratio. Recommend approval decision."
Run 1 (temp=0.0): "APPROVE - Low risk profile"
Run 2 (temp=0.0): "DENY - Moderate risk, recommend manual review"
Run 3 (temp=0.0): "APPROVE with conditions - Reduce credit limit to $10K"
Regulatory Risk
Inconsistent decisions can violate fair lending laws (ECOA, FCRA) and lead to:
- Discrimination claims
- Regulatory fines
- Reputational damage
- Loss of consumer trust
Scenario 2: Financial Document Analysis¶
Input: SEC 10-K filing, Question: "What is the company's total debt?"
Run 1: "$2.4 billion"
Run 2: "$2.4B in long-term debt, excluding short-term obligations"
Run 3: "Total debt: $2.4 billion (page 42, footnote 7)"
Issue: All factually correct, but inconsistent formatting breaks downstream automation.
Scenario 3: Regulatory Compliance Queries¶
Input: "Is this transaction reportable under FinCEN SAR requirements?"
Run 1: "Yes, meets threshold for suspicious activity reporting"
Run 2: "Insufficient information to determine. Request additional details."
Run 3: "No, transaction appears routine"
Compliance Failure
Missed Suspicious Activity Reports (SARs) can result in:
- Multi-million dollar fines
- Criminal liability
- License revocation
Research Findings: By the Numbers¶
Our research quantified drift across multiple dimensions using 480 total runs (n=16 concurrent runs per condition):
Overall Drift Rates (Temperature = 0.0)¶
| Model | Size | Consistency | Tier | Compliance Status |
|---|---|---|---|---|
| Qwen2.5-7B | 7B | 100% | Tier 1 | ✅ Audit-ready |
| IBM Granite-3-8B | 8B | 100% | Tier 1 | ✅ Audit-ready |
| Meta Llama-3.3-70B | 70B | 56-100% | Tier 2 | ⚠️ Task-specific |
| Mistral Medium | 40B | 56-100% | Tier 2 | ⚠️ Task-specific |
| GPT-OSS-120B | 120B | 12.5% [CI: 3.5–36.0%] | Tier 3 | ❌ Non-compliant |
Counterintuitive finding: 7-20B models achieve perfect determinism while 120B models show only 12.5% consistency!
Understanding Statistical Notation
Throughout this workshop, we report 95% Confidence Intervals (CI) for our findings. For example, "12.5% [CI: 3.5–36.0%]" means we measured 12.5% consistency, but the true value likely falls between 3.5% and 36.0%.
All Tier 1 vs Tier 3 comparisons showed 𝑝 < 0.0001, meaning these differences are highly statistically significant and not due to chance.
Drift by Task Type (Temperature = 0.0)¶
| Task | Consistency | Why? |
|---|---|---|
| SQL Generation | 100% | Structured output, deterministic syntax |
| Summarization | 100% | Well-defined task, narrow output space |
| RAG (Text-to-SQL) | 93.75% | Retrieval adds complexity |
| RAG (General) | 75-87.5% | Context-dependent, broader output space |
Impact of Temperature¶
At temperature = 0.2 (common in production):
| Task | Consistency | Mean Drift | Factual Drift Range |
|---|---|---|---|
| RAG | 56.25% | 0.081 | 0.000 - 0.375 |
| SQL | 100% | 0.000 | 0.000 |
| Summarization | 100% | 0.000 | 0.000 |
Key Takeaway
Even small temperature increases (0.0 → 0.2) can double drift rates for retrieval-augmented tasks!
Visualizing Drift¶
Example: Consistency Across Model Tiers¶
Tier Classification (16 concurrent runs, temp=0.0)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Tier 1 (7-20B):
Qwen2.5-7B ████████████████████ 100% ✅
Granite-3-8B ████████████████████ 100% ✅
GPT-OSS-20B ████████████████████ 100% ✅
Tier 2 (40-70B):
Llama-3.3-70B ████████████████ 80% △
Mistral Medium ████████████████ 80% △
Tier 3 (120B+):
GPT-OSS-120B ██▌ 12.5% ❌
Example: Drift Heat map (Temperature Sensitivity)¶
Task Type vs. Temperature
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
T=0.0 T=0.1 T=0.2 T=0.5
SQL 🟢100% 🟢100% 🟢100% 🟡95%
Summarize 🟢100% 🟢100% 🟢100% 🟡92%
RAG-SQL 🟢94% 🟡88% 🟡75% 🔴45%
RAG-General 🟡87% 🟡70% 🔴56% 🔴25%
🟢 Low drift 🟡 Moderate 🔴 High drift
Three Types of Drift¶
1. Syntactic Drift¶
Changes in formatting, whitespace, or presentation without semantic changes.
Run 1: "Total Assets: $2,400,000,000"
Run 2: "Total Assets: $2.4B"
Run 3: "Total Assets: 2.4 billion USD"
Impact: Breaks parsing logic, automation fails
2. Semantic Drift¶
Changes in meaning or interpretation.
Run 1: "High risk - recommend denial"
Run 2: "Moderate risk - manual review suggested"
Run 3: "Acceptable risk with conditions"
Impact: Different business outcomes, inconsistent decisions
3. Factual Drift¶
Contradictory or incorrect information across runs.
Run 1: "Company reported $500M revenue in Q4"
Run 2: "Q4 revenue was $550M according to the filing"
Run 3: "Revenue not disclosed in available documents"
Impact: Compliance violations, incorrect recommendations
Regulatory Context¶
Why Financial Services Care¶
- Model Risk Management (SR 11-7): Federal Reserve requires "validation" of models
- Fair Lending (ECOA): Consistent treatment of similar applicants
- Explainability (GDPR, FCRA): "Right to explanation" for automated decisions
- Audit Trail: Must reproduce past decisions for regulatory review
The Drift Challenge¶
"An AI system that produces different recommendations for identical inputs fails the fundamental requirement of consistency needed for regulatory compliance."
— Financial Services AI Governance Guidelines
Hands-On: Observe Drift in Action¶
Let's see drift firsthand with a simple example:
Step 1: Create a Test Script¶
Create a file called test_drift_simple.py:
import os
from openai import OpenAI
# Use Ollama (or change to your provider)
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # Not actually used by Ollama
)
prompt = "What is 2+2? Answer with just the number."
print("Testing drift with 5 identical runs:\n")
for i in range(1, 6):
response = client.chat.completions.create(
model="qwen2.5:7b-instruct",
messages=[{"role": "user", "content": prompt}],
temperature=0.0,
seed=42 # Explicit seed
)
answer = response.choices[0].message.content
print(f"Run {i}: {answer}")
Step 2: Run the Test¶
python test_drift_simple.py
Expected Output¶
You'll likely see variation even for this simple task:
Testing drift with 5 identical runs:
Run 1: 4
Run 2: 4
Run 3: The answer is 4
Run 4: 4
Run 5: 2 + 2 = 4
Discussion Point
Why do you think even a simple arithmetic question shows drift?
Key Takeaways¶
- Temperature=0.0 ≠ Determinism: Even "deterministic" settings show drift
- Task Matters: Structured tasks (SQL) are more stable than open-ended tasks (RAG)
- Regulatory Risk: Inconsistency threatens compliance in regulated industries
- Provider Variance: Different models/providers show different drift characteristics
- Measurement is Essential: You can't manage what you don't measure
Quiz: Test Your Understanding¶
Question 1: What is output drift?
Answer: Inconsistent outputs from an LLM given identical inputs and settings.
Question 2: Why is drift a problem for financial services?
Answer: It creates inconsistent decisions that violate regulatory requirements for fairness, explainability, and auditability.
Question 3: Which task showed the highest drift in research?
Answer: RAG (Retrieval-Augmented Generation) tasks, especially at temperature > 0.0.
Question 4: Does setting temperature=0.0 eliminate drift?
Answer: It depends on model size! Tier 1 models (7-20B) achieve 100% consistency at T=0.0, but Tier 3 models (120B+) show only 12.5% consistency even at T=0.0.
Next Steps¶
Now that you understand what output drift is and why it matters:
- Proceed to Lab 2: Setting Up Your Environment to configure API keys and providers
- Review the full research paper in
docs/resources/paper.md - Think about how drift might affect your own AI applications
Further Reading¶
Lab 1 Complete!
You now understand output drift and its implications. Ready to configure your environment? Move on to Lab 2: Setting Up Your Environment!