Skip to content

Lab 2: Setting Up Your Environment

Overview

In this lab, you'll configure API keys, test provider connectivity, and run your first deterministic evaluation to understand the framework's core components.

Duration: ~15 minutes

Learning Objectives

By the end of this lab, you will:

  • Configure API keys for at least one provider (Ollama recommended)
  • Understand the DeterministicRetriever and its role in compliance
  • Test framework components with a simple evaluation
  • Generate your first audit trail

Prerequisites

Step 1: Verify Ollama Installation

If using Ollama (recommended for getting started):

# Check if Ollama is running
curl http://localhost:11434/api/tags

If not running, start Ollama:

ollama serve

Pull the recommended model (if not already done):

ollama pull qwen2.5:7b-instruct

Why Qwen2.5:7B?

According to our research, 7-20B models achieve 100% deterministic outputs at T=0.0, making them ideal for regulated financial applications. Qwen2.5:7B is a Tier 1 model—audit-ready and compliance-safe.

Step 2: Configure Environment Variables

Create or edit your .env file in the repository root:

# Navigate to repository root
cd /path/to/output-drift-financial-llms

# Create .env file
touch .env

Add your API configuration:

# Ollama (local, free)
OLLAMA_BASE_URL=http://localhost:11434

# IBM watsonx.ai (optional but recommended for cross-provider validation)
WATSONX_API_KEY=your_api_key_here
WATSONX_PROJECT_ID=your_project_id_here
WATSONX_URL=https://us-south.ml.cloud.ibm.com

# Anthropic (optional)
ANTHROPIC_API_KEY=your_anthropic_api_key_here

# Google Gemini (optional)
GEMINI_API_KEY=your_gemini_api_key_here

Sensitive Data

Never commit .env to Git! It's already in .gitignore.

Step 3: Generate Synthetic Financial Database

Our framework uses a synthetic financial database for SQL generation tasks:

python data/generate_toy_finance.py

Expected output:

🏦 Generating synthetic financial database...
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Created tables:
  ✅ customers (100 records)
  ✅ accounts (150 records)
  ✅ transactions (500 records)
  ✅ loans (75 records)

Database: data/toy_finance.sqlite (45 KB)
✅ Generation complete!

This creates data/toy_finance.sqlite containing realistic financial data for testing.

Step 4: Test Framework Components

Let's test the core framework components to ensure everything is working.

Test 1: DeterministicRetriever

The DeterministicRetriever (harness/deterministic_retriever.py) is crucial for compliance—it ensures SEC 10-K retrieval order is deterministic and reproducible.

Create test_retriever.py:

from harness.deterministic_retriever import create_retriever_from_files

# Initialize retriever from SEC filings directory
retriever = create_retriever_from_files(
    corpus_path="data/sec",  # SEC 10-K filings
    chunk_size=200,
    overlap=50
)

# Test query
query = "What were net credit losses in 2023?"
results = retriever.retrieve(query, k=5)

print("Deterministic Retrieval Test")
print("=" * 50)
for i, (snippet_id, text, metadata) in enumerate(results, 1):
    print(f"\nChunk {i}:")
    print(f"  Snippet ID: {snippet_id}")
    print(f"  Text: {text[:100]}...")

print("\nRetrieval is deterministic with stable ordering!")

Run it:

python test_retriever.py

Why Multi-Key Ordering?

The retriever uses multi-key ordering (score↓, section_priority↑, snippet_id↑, chunk_idx↑) to ensure retrieval order is a compliance requirement, not a performance optimization. This guarantees the same chunks are retrieved in the same order every time.

Test 2: Simple Drift Evaluation

Now let's run a minimal drift test with 5 runs using the OpenAI client:

Create test_simple_drift.py:

#!/usr/bin/env python3
"""Simple drift evaluation using Ollama via OpenAI client."""
from openai import OpenAI

# Initialize Ollama client
client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # Not used by Ollama
)

# Simple prompt
prompt = "What is the sum of 2 + 2? Answer with just the number."

print("🧪 Running 5 identical queries at T=0.0")
print("=" * 50)

responses = []
for i in range(1, 6):
    response = client.chat.completions.create(
        model="qwen2.5:7b-instruct",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.0,
        seed=42
    )
    answer = response.choices[0].message.content
    responses.append(answer)
    print(f"Run {i}: {answer}")

# Check consistency
unique_responses = set(responses)
consistency = (len(unique_responses) == 1)

print("\n" + "=" * 50)
print(f"Unique responses: {len(unique_responses)}")
print(f"Consistency: {'✅ 100%' if consistency else f'❌ {100/len(responses):.0f}%'}")

Run it:

python test_simple_drift.py

Expected output for Tier 1 models (Qwen2.5:7B, Granite-3-8B):

🧪 Running 5 identical queries at T=0.0
==================================================
Run 1: 4
Run 2: 4
Run 3: 4
Run 4: 4
Run 5: 4

==================================================
Unique responses: 1
Consistency: ✅ 100%

Tier 1 Determinism

7-20B models achieve 100% consistency at T=0.0—this is what makes them audit-ready!

Step 5: Understanding Task Definitions

The framework defines three core financial tasks in harness/task_definitions.py:

# View task definitions
cat harness/task_definitions.py

The three core tasks:

Task File Reference Tier 1 Consistency Purpose
SQL harness/task_definitions.py:20-45 100% Text-to-SQL generation
Summarize harness/task_definitions.py:47-72 100% JSON summarization with schema
RAG harness/task_definitions.py:74-99 93.75% Retrieval-augmented Q&A

Each task includes: - System prompts optimized for determinism - Temperature=0.0 and seed=42 defaults - Validation schemas (JSON schema for summarization, SQL syntax checker) - Citation requirements (for RAG tasks)

Step 6: Review Sample Audit Trail

The framework generates JSONL (JSON Lines) audit trails with regulatory mappings. Let's examine the sample provided:

# View sample audit trail entry
head -n 1 examples/sample_audit_trail.jsonl | python -m json.tool

Example audit trail entry:

{
  "timestamp": "2025-11-07T13:45:23Z",
  "run_id": "lab2_test_001",
  "model": "qwen2.5:7b-instruct",
  "provider": "ollama",
  "temperature": 0.0,
  "seed": 42,
  "prompt_hash": "a3d8f92b1c4e5f6789abcdef...",
  "response_hash": "b2c1e7d8a9f6543210fedcba...",
  "task_type": "sql",
  "response": "SELECT customer_name, account_balance FROM accounts WHERE account_balance > 100000",
  "compliance_metrics": {
    "citation_accuracy": 1.0,
    "schema_valid": true,
    "decision_flip": false,
    "factual_drift": 0.0
  },
  "regulatory_mappings": {
    "FSB_principle": "consistent_decisions",
    "CFTC_requirement": "document_ai_outcomes",
    "SR_11_7": "model_validation"
  }
}

Bi-Temporal Logging

The audit trail uses bi-temporal logging to enable regulatory review and attestation months after decisions were made—critical for financial audits.

Understanding Framework Components

1. DeterministicRetriever

File: harness/deterministic_retriever.py

from harness.deterministic_retriever import create_retriever_from_files

retriever = create_retriever_from_files(
    corpus_path="data/sec",
    chunk_size=200,
    overlap=50
)

Purpose: Ensures SEC 10-K retrieval is deterministic and auditable.

Features: - Multi-key ordering (score, section priority, snippet ID, chunk index) - Stable chunk IDs for reproducibility - Section-aware retrieval (prioritizes financial statement sections)

2. Task Definitions

The framework includes 3 core task types:

Task Description Tier 1 Consistency
SQL Text-to-SQL generation from natural language 100%
Summary JSON summarization of financial data 100%
RAG Retrieval-augmented Q&A over SEC 10-Ks 93.75%

Why SQL and Summary achieve perfect scores: - Structured output formats - Deterministic syntax - Narrow output space

3. Cross-Provider Validation

File: harness/cross_provider_validation.py

from harness.cross_provider_validation import CrossProviderValidator

validator = CrossProviderValidator(
    providers=["ollama", "watsonx"],
    tolerance_pct=5.0  # GAAP materiality threshold
)

# Validate pre-collected outputs from different providers
outputs = {"ollama": ollama_result, "watsonx": watsonx_result}
results = validator.validate(outputs, task_type="sql")
print(f"Consistent: {results['consistent']}")
print(f"Similarity: {results['similarity_scores']}")

Purpose: Validate consistency between local (Ollama) and cloud (watsonx.ai) deployments using pre-collected outputs.

GAAP Materiality: Uses ±5% threshold from GAAP auditing standards for financial statement materiality.

Troubleshooting

Ollama Connection Failed

# Check if Ollama is running
curl http://localhost:11434/api/tags

# If not, start it:
ollama serve

Model Not Found

# List available models
ollama list

# Pull the model if missing
ollama pull qwen2.5:7b-instruct

Database Not Found

# Regenerate the database
python data/generate_toy_finance.py

Import Errors

# Ensure virtual environment is activated
source venv/bin/activate  # macOS/Linux
# or
venv\Scripts\activate  # Windows

# Reinstall dependencies
pip install -r requirements.txt

Key Takeaways

  1. Tier 1 Models: 7-20B models (Qwen2.5, Granite-3-8B, GPT-OSS-20B) achieve 100% determinism
  2. DeterministicRetriever: Ensures reproducible SEC 10-K retrieval
  3. Audit Trails: Bi-temporal JSONL logging enables regulatory review
  4. Task Types: SQL and summarization are perfectly deterministic; RAG requires careful configuration
  5. Cross-Provider: Can validate consistency between local and cloud deployments

Quiz: Test Your Understanding

Why use multi-key ordering in DeterministicRetriever?

Answer: To ensure retrieval order is deterministic and reproducible for compliance. Even if chunks have the same relevance score, they must return in a consistent order for audit trails.

What makes 7-20B models Tier 1 (audit-ready)?

Answer: They achieve 100% consistency at T=0.0 across all task types, meeting regulatory requirements for reproducibility.

What is the GAAP materiality threshold used in cross-provider validation?

Answer: ±5%, based on GAAP auditing standards for financial statement materiality.

Next Steps

Now that your environment is configured and you understand the framework components:

  1. Proceed to Lab 3: Running Your First Experiment to run drift evaluations
  2. Review task definitions in harness/task_definitions.py
  3. Examine the DeterministicRetriever implementation in harness/deterministic_retriever.py
  4. Study the CrossProviderValidator code in harness/cross_provider_validation.py

Lab 2 Complete!

Your environment is configured and tested. Ready to run experiments? Move on to Lab 3: Running Your First Experiment!