Lab 3: Running Your First Experiment¶
Overview¶
In this lab, you'll run a complete drift evaluation experiment, just like the ones from the paper. You'll test different concurrency levels, temperatures, and task types to understand how these factors affect determinism.
Duration: ~30 minutes
Learning Objectives¶
By the end of this lab, you will:
- Run experiments with varying concurrency (1, 4, 16 runs)
- Compare drift at temperature 0.0 vs 0.2
- Understand how task types affect consistency
- Analyze JSONL audit trails
- Reproduce key findings from the paper
Prerequisites¶
- Completed Lab 2: Setting Up Your Environment
- At least one provider configured (Ollama with qwen2.5:7b-instruct recommended)
- Synthetic database generated (
data/toy_finance.sqlite)
Experimental Design (Paper Methodology)¶
Our paper evaluated 5 models across 480 runs with the following design:
| Parameter | Values |
|---|---|
| Models | Qwen2.5-7B, Granite-3-8B, Llama-3.3-70B, Mistral-Medium, GPT-OSS-120B |
| Temperatures | 0.0, 0.2 |
| Concurrency | n=16 per condition |
| Tasks | SQL generation, RAG (Text-to-SQL), JSON summarization |
In this lab, we'll run a subset to understand the methodology, then you can scale to full experiments.
Step 1: Single-Run Baseline (Concurrency = 1)¶
Let's start with a single run to establish a baseline:
python run_evaluation.py \
--providers ollama \
--models qwen2.5:7b-instruct \
--temperatures 0.0 \
--concurrency 1 \
--tasks sql \
--repeats 1
Expected output:
π Output Drift Evaluation Framework
ββββββββββββββββββββββββββββββββββββββββ
Configuration:
Provider: ollama
Model: qwen2.5:7b-instruct
Temperature: 0.0
Concurrency: 1
Task: sql
Prompt: "Generate SQL to find all customers with account balance > $100,000"
Run 1/1...
Response: SELECT customer_name, account_balance FROM accounts
WHERE account_balance > 100000
Execution time: 1.2s
Results:
Runs completed: 1
Schema valid: β
Yes
Audit trail: traces/lab3_single.jsonl
β
Single-run baseline complete!
Analysis: With n=1, we can't measure drift yet. We need multiple runs.
Step 2: Low Concurrency Test (n=4)¶
Now let's run 4 concurrent queries:
python run_evaluation.py \
--providers ollama \
--models qwen2.5:7b-instruct \
--temperatures 0.0 \
--concurrency 4 \
--tasks sql \
--repeats 4
Expected output:
Running 4 concurrent queries...
ββββββββββββββββββββββββββββββββ 4/4 [00:05]
Results:
Consistency: 100.0% (4/4 identical)
Mean Drift: 0.000
Jaccard Similarity: 1.000
Schema Violations: 0
Decision Flips: 0
Unique responses: 1
Response 1 (4 occurrences):
"SELECT customer_name, account_balance FROM accounts WHERE account_balance > 100000"
β
Perfect consistency at n=4!
Tier 1 Performance
7-20B models maintain 100% consistency even with concurrent requestsβcritical for production workloads.
Step 3: Paper-Standard Test (n=16)¶
Now run the same configuration used in the paper (n=16):
python run_evaluation.py \
--providers ollama \
--models qwen2.5:7b-instruct \
--temperatures 0.0 \
--concurrency 16 \
--tasks sql \
--repeats 16
Expected output:
Running 16 concurrent queries...
ββββββββββββββββββββββββββββββββ 16/16 [00:12]
Results:
Consistency: 100.0% (16/16 identical)
Mean Drift: 0.000
Jaccard Similarity: 1.000
Schema Violations: 0
Decision Flips: 0
Unique responses: 1
Response 1 (16 occurrences):
"SELECT customer_name, account_balance FROM accounts WHERE account_balance > 100000"
β
Perfect consistency at n=16!
Key Finding: Qwen2.5-7B achieves 100% consistency at n=16, confirming Tier 1 classification.
Step 4: Temperature Sensitivity Test¶
Now let's test what happens when we increase temperature to 0.2:
python run_evaluation.py \
--providers ollama \
--models qwen2.5:7b-instruct \
--temperatures 0.2 \
--concurrency 16 \
--tasks sql \
--repeats 16
Expected output (SQL task):
Running 16 concurrent queries...
ββββββββββββββββββββββββββββββββ 16/16 [00:12]
Results:
Consistency: 100.0% (16/16 identical)
Mean Drift: 0.000
Temperature: 0.2
β
SQL generation remains deterministic at T=0.2!
Structured Task Resilience
SQL generation maintains 100% consistency even at T=0.2 because of its structured output format and deterministic syntax.
Step 5: RAG Task Comparison¶
Now let's test a RAG task, which our paper shows is more susceptible to drift:
python run_evaluation.py \
--providers ollama \
--models qwen2.5:7b-instruct \
--temperatures 0.0 \
--concurrency 16 \
--tasks rag \
--repeats 16
Expected output:
Running 16 concurrent queries...
ββββββββββββββββββββββββββββββββ 16/16 [00:18]
Task: RAG (Retrieval-Augmented Generation)
Prompt: "What were Citigroup's net credit losses in 2023?"
Results:
Consistency: 93.75% (15/16 identical)
Mean Drift: 0.012
Factual Drift: 0.000
Citation Accuracy: 1.0
Unique responses: 2
Response 1 (15 occurrences):
"According to Citigroup's 2024 10-K (page 145), net credit losses were $2.4 billion in 2023."
Response 2 (1 occurrence):
"Citigroup reported net credit losses of $2.4B in 2023 (10-K filing, page 145)."
β
Minor syntactic drift, but factual consistency maintained!
RAG vs SQL
RAG tasks show slightly lower consistency (93.75% vs 100%) due to:
- Broader output space (natural language)
- Retrieval context variations
- Formatting flexibility
Now test RAG at T=0.2:
python run_evaluation.py \
--providers ollama \
--models qwen2.5:7b-instruct \
--temperatures 0.2 \
--concurrency 16 \
--tasks rag \
--repeats 16
Expected output (from paper findings):
Results:
Consistency: 56.25% (9/16 identical)
Mean Drift: 0.081
Factual Drift Range: 0.000 - 0.375
β οΈ Substantial drift at T=0.2 for RAG tasks!
Paper Finding Confirmed: RAG tasks at T=0.2 show 56.25% consistency, making them unsuitable for compliance workflows without strict T=0.0.
Step 6: Multi-Task Evaluation¶
Run all three task types in sequence:
# Run all three tasks at once
python run_evaluation.py \
--providers ollama \
--models qwen2.5:7b-instruct \
--temperatures 0.0 \
--concurrency 16 \
--tasks rag,summary,sql \
--repeats 16
Summary script to compare results:
Create analyze_lab3.py:
import json
import pandas as pd
tasks = ["sql", "summary", "rag"]
results = []
for task in tasks:
with open(f"traces/lab3_{task}.jsonl") as f:
data = [json.loads(line) for line in f]
consistency = len(set(d["response_hash"] for d in data)) == 1
consistency_pct = 100.0 if consistency else (len(data) / len(set(d["response_hash"] for d in data))) * 100
results.append({
"Task": task.upper(),
"Runs": len(data),
"Consistency": f"{consistency_pct:.1f}%",
"Mean Drift": f"{sum(d['compliance_metrics']['factual_drift'] for d in data) / len(data):.3f}"
})
df = pd.DataFrame(results)
print("\nπ Multi-Task Evaluation Results (T=0.0, n=16)")
print("=" * 60)
print(df.to_string(index=False))
Run it:
python analyze_lab3.py
Expected output:
π Multi-Task Evaluation Results (T=0.0, n=16)
============================================================
Task Runs Consistency Mean Drift
SQL 16 100.0% 0.000
SUMMARIZE 16 100.0% 0.000
RAG 16 93.8% 0.012
Understanding the Results¶
Consistency Metric¶
Formula: consistency = (identical_responses / total_runs) * 100
- 100%: All responses identical (byte-for-byte)
- 93.75%: 15/16 responses identical, 1 syntactic variant
- <90%: Significant drift, not compliance-safe
Mean Drift Metric¶
Formula: Jaccard distance between token sets
- 0.000: Perfect determinism
- 0.012: Minor syntactic variation
- >0.05: Semantic drift
- >0.1: Factual inconsistencies
Paper Findings Reproduced¶
| Task | Expected (Paper) | Your Results | Match? |
|---|---|---|---|
| SQL (T=0.0) | 100% | 100% | β |
| Summarize (T=0.0) | 100% | 100% | β |
| RAG (T=0.0) | 93.75% | ~94% | β |
Analyzing Audit Trails¶
Audit trails are stored as JSONL (JSON Lines)βone JSON object per line.
View a specific run:
# Pretty-print the 5th run
sed -n '5p' traces/lab3_concurrent_16.jsonl | python -m json.tool
Example entry:
{
"timestamp": "2025-11-07T14:23:45.123Z",
"run_id": "lab3_concurrent_16_005",
"model": "qwen2.5:7b-instruct",
"provider": "ollama",
"temperature": 0.0,
"seed": 42,
"concurrency_idx": 5,
"task_type": "sql",
"prompt": "Generate SQL to find all customers with account balance > $100,000",
"response": "SELECT customer_name, account_balance FROM accounts WHERE account_balance > 100000",
"prompt_hash": "sha256:a3d8f92b1c4e5f6789abcdef",
"response_hash": "sha256:b2c1e7d8a9f6543210fedcba",
"execution_time_ms": 1245,
"compliance_metrics": {
"schema_valid": true,
"citation_accuracy": 1.0,
"decision_flip": false,
"factual_drift": 0.0
},
"regulatory_mappings": {
"FSB": "consistent_decisions",
"CFTC": "document_ai_outcomes",
"SR_11_7": "model_validation"
}
}
Key fields for audits:
prompt_hash: SHA-256 of input (for duplicate detection)response_hash: SHA-256 of output (for consistency checking)compliance_metrics: Drift measuresregulatory_mappings: Compliance framework mappings
Comparing Audit Trails¶
Compare two runs to find differences:
import json
# Load two runs
with open("traces/lab3_concurrent_16.jsonl") as f:
lines = f.readlines()
run1 = json.loads(lines[0])
run2 = json.loads(lines[1])
print("Run 1 response hash:", run1["response_hash"])
print("Run 2 response hash:", run2["response_hash"])
print("Identical?", run1["response_hash"] == run2["response_hash"])
if run1["response"] != run2["response"]:
print("\nResponse Diff:")
print("Run 1:", run1["response"])
print("Run 2:", run2["response"])
else:
print("\nβ
Responses are identical!")
Advanced: Full Paper Replication¶
To fully reproduce the paper's 480 runs:
# This will take ~30-45 minutes
python run_evaluation.py \
--providers ollama \
--models qwen2.5:7b-instruct,granite-3-8b,llama-3.3-70b \
--temperatures 0.0,0.2 \
--concurrency 1,4,16 \
--tasks rag,summary,sql \
--repeats 16
Resource Intensive
Full replication requires: - All 5 models available (some may require API keys) - ~45 minutes of runtime - ~500 MB of trace data
Troubleshooting¶
Inconsistent Results¶
If you're seeing drift where you shouldn't (e.g., SQL at T=0.0):
# Check model version
ollama show qwen2.5:7b-instruct
# Ensure seed is set
# In run_evaluation.py, verify: seed=42
Rate Limiting¶
If using cloud providers (watsonx, OpenAI):
# Add rate limiting in configuration
--rate-limit 10 # requests per minute
--retry-delay 5 # seconds between retries
Out of Memory¶
For large concurrency (n=16):
# Reduce batch size
--batch-size 4 # Process 4 at a time instead of 16
Key Takeaways¶
- 7-20B models (Tier 1) achieve 100% consistency at T=0.0 for all tasks
- Concurrency doesn't affect consistency for Tier 1 models (n=1, 4, or 16)
- Task structure matters: SQL/summarization > RAG for determinism
- Temperature sensitivity: RAG tasks degrade significantly at T=0.2
- Audit trails provide complete reproducibility for regulatory review
Quiz: Test Your Understanding¶
Why does SQL maintain 100% consistency even at T=0.2?
Answer: SQL has a structured output format with deterministic syntax, limiting the output space and making it more resistant to temperature-induced drift.
What consistency % did RAG tasks achieve at T=0.2 in the paper?
Answer: 56.25% (9/16 runs identical), showing substantial drift that makes them unsuitable for compliance workflows at elevated temperatures.
What is the purpose of the response_hash field in audit trails?
Answer: SHA-256 hash enables fast consistency checking across runs without string comparisonβcritical for large-scale audits.
Next Steps¶
Now that you've run experiments and understand the methodology:
- Proceed to Lab 4: Analyzing Drift Metrics to visualize and interpret results
- Explore different prompts in
prompts/templates.json - Try modifying temperature and concurrency parameters
Lab 3 Complete!
You've successfully run drift evaluations and reproduced key paper findings! Ready to analyze the data? Move on to Lab 4: Analyzing Drift Metrics!