Lab 6: Extending the Framework¶

Overview¶

In this lab, you'll learn how to customize the framework for your own use cases by adding new tasks, modifying prompt templates, and integrating with your workflows.

Duration: ~30 minutes

Learning Objectives¶

By the end of this lab, you will:

Add custom tasks to run_evaluation.py
Modify existing prompts for your domain
Integrate the framework into CI/CD pipelines
Create custom compliance validators
Export results for regulatory reporting

Prerequisites¶

Completed Lab 5: Cross-Provider Testing
Understanding of JSON structure
Familiarity with your organization's compliance requirements

Framework Architecture¶

The framework is designed for extensibility:

output-drift-financial-llms/
├── prompts/
│   └── templates.json          # ← Add your custom tasks here
├── harness/
│   ├── task_definitions.py     # Task execution logic
│   ├── deterministic_retriever.py
│   └── cross_provider_validation.py
├── data/
│   ├── sec_filings/            # ← Add your own documents
│   └── toy_finance.sqlite      # ← Use your database
└── examples/                   # ← Reference implementations

Step 1: Understanding Task Structure¶

Task prompts are defined in run_evaluation.py (see the build_prompts() function) and task formatting/validation lives in harness/task_definitions.py.

# View the prompt builder
grep -A 30 "def build_prompts" run_evaluation.py

Current tasks: - rag: RAG Q&A over SEC 10-K filings - summary: JSON summarization with schema validation - sql: Text-to-SQL generation

Each task uses: - Task formatting functions in harness/task_definitions.py - System prompts built into the formatting functions - temperature: 0.0 for determinism - seed: 42 for reproducibility

Step 2: Add a Custom Task - Credit Risk Analysis¶

Let's add a new task for credit risk assessment:

To add a custom task, define a new task configuration. Here's an example credit risk assessment task structure:

{
  "credit_risk": {
    "description": "Credit risk classification with explainability requirements",
    "prompts": [
      {
        "id": "cr1",
        "profile": {
          "credit_score": 680,
          "income": 75000,
          "debt_to_income": 0.20,
          "employment_years": 5
        },
        "question": "Classify credit risk (LOW/MEDIUM/HIGH) and explain in one sentence.",
        "expected_risk": "LOW",
        "compliance_requirements": ["ECOA", "FCRA"]
      },
      {
        "id": "cr2",
        "profile": {
          "credit_score": 620,
          "income": 50000,
          "debt_to_income": 0.45,
          "employment_years": 1
        },
        "question": "Classify credit risk (LOW/MEDIUM/HIGH) and explain in one sentence.",
        "expected_risk": "MEDIUM",
        "compliance_requirements": ["ECOA", "FCRA"]
      }
    ],
    "system_prompt": "You are a fair and consistent credit risk analyst. Classify risk as LOW, MEDIUM, or HIGH. Provide a brief explanation in one sentence. Be consistent: identical inputs must always produce identical outputs for regulatory compliance.",
    "output_schema": {
      "type": "object",
      "properties": {
        "risk_level": {"type": "string", "enum": ["LOW", "MEDIUM", "HIGH"]},
        "explanation": {"type": "string"}
      },
      "required": ["risk_level", "explanation"]
    },
    "temperature": 0.0,
    "seed": 42
  }
}

Step 3: Create Task Executor for Custom Task¶

Create custom_credit_risk.py:

#!/usr/bin/env python3
"""
Custom credit risk classification task with drift testing.
"""
import json
from openai import OpenAI

# Define custom task configuration inline
credit_risk_task = {
    "system_prompt": "You are a fair and consistent credit risk analyst. Classify risk as LOW, MEDIUM, or HIGH. Provide a brief explanation in one sentence.",
    "temperature": 0.0,
    "seed": 42
}

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

def run_credit_risk_assessment(profile: dict, model: str = "qwen2.5:7b-instruct", n_runs: int = 5):
    """Run credit risk assessment n times to test consistency."""
    prompt = f"""Profile:
- Credit Score: {profile['credit_score']}
- Annual Income: ${profile['income']:,}
- Debt-to-Income Ratio: {profile['debt_to_income']:.0%}
- Employment Years: {profile['employment_years']}

{credit_risk_task['prompts'][0]['question']}"""

    results = []
    for i in range(1, n_runs + 1):
        response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": credit_risk_task['system_prompt']},
                {"role": "user", "content": prompt}
            ],
            temperature=0.0,
            seed=42
        )
        output = response.choices[0].message.content
        results.append(output)
        print(f"Run {i}: {output}")

    # Check consistency
    unique = len(set(results))
    consistency = (1 / unique) * 100 if unique > 0 else 100.0

    print(f"\n📊 Results:")
    print(f"  Total runs: {n_runs}")
    print(f"  Unique outputs: {unique}")
    print(f"  Consistency: {consistency:.0f}%")
    print(f"  Status: {'✅ Audit-ready' if consistency == 100 else '⚠️ Drift detected'}")

    return results

# Test with the first profile
profile1 = credit_risk_task['prompts'][0]['profile']
print("🧪 Testing Credit Risk Assessment\n")
results = run_credit_risk_assessment(profile1, n_runs=5)

Run it:

python custom_credit_risk.py

Expected output (Tier 1 model):

🧪 Testing Credit Risk Assessment

Run 1: {"risk_level": "LOW", "explanation": "Strong credit profile with good income-to-debt ratio and stable employment history."}
Run 2: {"risk_level": "LOW", "explanation": "Strong credit profile with good income-to-debt ratio and stable employment history."}
Run 3: {"risk_level": "LOW", "explanation": "Strong credit profile with good income-to-debt ratio and stable employment history."}
Run 4: {"risk_level": "LOW", "explanation": "Strong credit profile with good income-to-debt ratio and stable employment history."}
Run 5: {"risk_level": "LOW", "explanation": "Strong credit profile with good income-to-debt ratio and stable employment history."}

📊 Results:
  Total runs: 5
  Unique outputs: 1
  Consistency: 100%
  Status: ✅ Audit-ready

Step 4: Add Domain-Specific Documents¶

To use RAG with your own documents:

Add documents to data/sec/ (or create a new folder):

mkdir -p data/custom_docs

Update deterministic_retriever.py to point to your folder:

from harness.deterministic_retriever import DeterministicRetriever

retriever = DeterministicRetriever(
    corpus_path="data/custom_docs/",  # Your documents here
    chunk_size=512,
    overlap=50
)

Test retrieval:

query = "What is our company's annual revenue?"
results = retriever.retrieve(query, top_k=5)
for i, chunk in enumerate(results, 1):
    print(f"Chunk {i}: {chunk['text'][:100]}...")

Step 5: CI/CD Integration¶

Integrate drift testing into your CI/CD pipeline:

Create .github/workflows/drift-test.yml:

name: LLM Output Drift Testing

on:
  pull_request:
    branches: [main]
  schedule:
    - cron: '0 0 * * *'  # Daily at midnight

jobs:
  drift-test:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v3

      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'

      - name: Install dependencies
        run: |
          pip install -r requirements.txt

      - name: Run drift evaluation
        env:
          WATSONX_API_KEY: ${{ secrets.WATSONX_API_KEY }}
          WATSONX_PROJECT_ID: ${{ secrets.WATSONX_PROJECT_ID }}
        run: |
          python run_evaluation.py \
            --model ibm/granite-3-8b-instruct \
            --temperature 0.0 \
            --concurrency 16 \
            --task sql \
            --output traces/ci_test.jsonl

      - name: Validate consistency
        run: |
          python -c "
          import json
          with open('traces/ci_test.jsonl') as f:
              data = [json.loads(line) for line in f]
          unique = len(set(d['response_hash'] for d in data))
          assert unique == 1, f'Drift detected: {unique} unique outputs'
          print('✅ Consistency check passed')
          "

      - name: Upload audit trail
        uses: actions/upload-artifact@v3
        with:
          name: drift-test-results
          path: traces/ci_test.jsonl

This pipeline: - Runs on every PR and daily - Tests for drift with n=16 - Fails CI if drift detected - Uploads audit trails as artifacts

Step 6: Custom Compliance Validator¶

Create a validator for your specific regulations:

Create custom_compliance_validator.py:

#!/usr/bin/env python3
"""
Custom compliance validator for specific regulatory frameworks.
"""
import json
from typing import Dict, List

class CustomComplianceValidator:
    """
    Validate LLM outputs against custom compliance requirements.
    """

    def __init__(self, frameworks: List[str]):
        """
        Initialize validator.

        Args:
            frameworks: List of compliance frameworks (e.g., ["ECOA", "FCRA", "GDPR"])
        """
        self.frameworks = frameworks
        self.rules = self._load_rules()

    def _load_rules(self) -> Dict[str, callable]:
        """Load validation rules for each framework."""
        rules = {}

        # ECOA (Equal Credit Opportunity Act)
        if "ECOA" in self.frameworks:
            rules["ecoa_consistency"] = self._check_consistency
            rules["ecoa_no_discrimination"] = self._check_no_discrimination

        # FCRA (Fair Credit Reporting Act)
        if "FCRA" in self.frameworks:
            rules["fcra_explainability"] = self._check_explainability

        # GDPR
        if "GDPR" in self.frameworks:
            rules["gdpr_right_to_explanation"] = self._check_explainability
            rules["gdpr_data_minimization"] = self._check_data_minimization

        return rules

    def _check_consistency(self, outputs: List[str]) -> bool:
        """ECOA: Similar applicants must receive similar treatment."""
        unique_outputs = len(set(outputs))
        return unique_outputs == 1  # 100% consistency required

    def _check_no_discrimination(self, output: str) -> bool:
        """ECOA: No references to protected classes."""
        protected_terms = ["race", "gender", "age", "religion", "nationality"]
        return not any(term in output.lower() for term in protected_terms)

    def _check_explainability(self, output: str) -> bool:
        """FCRA/GDPR: Must include explanation."""
        return "explanation" in output.lower() or "because" in output.lower()

    def _check_data_minimization(self, output: str) -> bool:
        """GDPR: Don't expose unnecessary personal data."""
        pii_indicators = ["ssn", "social security", "passport", "driver license"]
        return not any(indicator in output.lower() for indicator in pii_indicators)

    def validate(self, outputs: List[str]) -> Dict[str, any]:
        """
        Run all validation rules.

        Args:
            outputs: List of LLM outputs to validate

        Returns:
            {
                "compliant": bool,
                "passed_rules": List[str],
                "failed_rules": List[str],
                "details": Dict[str, bool]
            }
        """
        results = {}
        for rule_name, rule_func in self.rules.items():
            if rule_name.endswith("_consistency"):
                results[rule_name] = rule_func(outputs)
            else:
                # Check all outputs
                results[rule_name] = all(rule_func(output) for output in outputs)

        passed = [k for k, v in results.items() if v]
        failed = [k for k, v in results.items() if not v]

        return {
            "compliant": len(failed) == 0,
            "passed_rules": passed,
            "failed_rules": failed,
            "details": results
        }

# Example usage
validator = CustomComplianceValidator(frameworks=["ECOA", "FCRA"])

# Test outputs from credit risk assessment
test_outputs = [
    '{"risk_level": "LOW", "explanation": "Strong credit profile with good income-to-debt ratio."}',
    '{"risk_level": "LOW", "explanation": "Strong credit profile with good income-to-debt ratio."}',
    '{"risk_level": "LOW", "explanation": "Strong credit profile with good income-to-debt ratio."}'
]

result = validator.validate(test_outputs)

print("\n📋 Compliance Validation Report")
print("=" * 60)
print(f"Compliant: {'✅ YES' if result['compliant'] else '❌ NO'}")
print(f"\nPassed rules ({len(result['passed_rules'])}):")
for rule in result['passed_rules']:
    print(f"  ✅ {rule}")
if result['failed_rules']:
    print(f"\nFailed rules ({len(result['failed_rules'])}):")
    for rule in result['failed_rules']:
        print(f"  ❌ {rule}")

Run it:

python custom_compliance_validator.py

Step 7: Export for Regulatory Reporting¶

Generate compliance reports from audit trails:

Create generate_compliance_report.py:

#!/usr/bin/env python3
"""
Generate compliance report from audit trails.
"""
import json
import pandas as pd
from datetime import datetime

def generate_report(trace_file: str, output_format: str = "html"):
    """Generate compliance report from JSONL audit trail."""

    # Load audit trail
    with open(trace_file) as f:
        traces = [json.loads(line) for line in f]

    # Calculate metrics
    total_runs = len(traces)
    unique_outputs = len(set(t['response_hash'] for t in traces))
    consistency = (unique_outputs == 1)
    consistency_pct = (1 / unique_outputs * 100) if unique_outputs > 0 else 100.0

    # Compliance metrics
    schema_violations = sum(not t['compliance_metrics']['schema_valid'] for t in traces)
    decision_flips = sum(t['compliance_metrics']['decision_flip'] for t in traces)
    mean_drift = sum(t['compliance_metrics']['factual_drift'] for t in traces) / total_runs

    # Generate HTML report
    html = f"""
    <!DOCTYPE html>
    <html>
    <head>
        <title>LLM Compliance Report</title>
        <style>
            body {{ font-family: Arial, sans-serif; margin: 40px; }}
            h1 {{ color: #0f62fe; }}
            .metric {{ padding: 10px; margin: 10px 0; border-left: 4px solid #0f62fe; background: #f4f4f4; }}
            .pass {{ border-left-color: #24a148; }}
            .fail {{ border-left-color: #da1e28; }}
            table {{ border-collapse: collapse; width: 100%; margin: 20px 0; }}
            th, td {{ border: 1px solid #ddd; padding: 12px; text-align: left; }}
            th {{ background-color: #0f62fe; color: white; }}
        </style>
    </head>
    <body>
        <h1>LLM Output Drift Compliance Report</h1>
        <p><strong>Generated:</strong> {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}</p>
        <p><strong>Audit Trail:</strong> {trace_file}</p>

        <h2>Executive Summary</h2>
        <div class="metric {'pass' if consistency else 'fail'}">
            <strong>Consistency:</strong> {consistency_pct:.1f}% ({unique_outputs} unique output{'s' if unique_outputs != 1 else ''})
        </div>
        <div class="metric {'pass' if schema_violations == 0 else 'fail'}">
            <strong>Schema Violations:</strong> {schema_violations}
        </div>
        <div class="metric {'pass' if decision_flips == 0 else 'fail'}">
            <strong>Decision Flips:</strong> {decision_flips}
        </div>
        <div class="metric {'pass' if mean_drift < 0.05 else 'fail'}">
            <strong>Mean Drift:</strong> {mean_drift:.3f}
        </div>

        <h2>Regulatory Compliance Status</h2>
        <table>
            <tr>
                <th>Requirement</th>
                <th>Status</th>
                <th>Evidence</th>
            </tr>
            <tr>
                <td>SR 11-7 (Model Validation)</td>
                <td>{'✅ PASS' if consistency else '❌ FAIL'}</td>
                <td>Deterministic behavior: {consistency_pct:.1f}%</td>
            </tr>
            <tr>
                <td>ECOA (Consistent Decisions)</td>
                <td>{'✅ PASS' if decision_flips == 0 else '❌ FAIL'}</td>
                <td>Decision flips: {decision_flips}</td>
            </tr>
            <tr>
                <td>FSB (Output Consistency)</td>
                <td>{'✅ PASS' if mean_drift < 0.05 else '❌ FAIL'}</td>
                <td>Mean drift: {mean_drift:.3f}</td>
            </tr>
        </table>

        <h2>Model Configuration</h2>
        <pre>{json.dumps(traces[0], indent=2)[:500]}...</pre>
    </body>
    </html>
    """

    # Save report
    output_file = trace_file.replace('.jsonl', '_compliance_report.html')
    with open(output_file, 'w') as f:
        f.write(html)

    print(f"✅ Compliance report generated: {output_file}")
    return output_file

# Generate report
generate_report("traces/lab3_sql.jsonl")

Run it:

python generate_compliance_report.py

Open the HTML report in your browser to see a formatted compliance report.

Key Takeaways¶

Templates are JSON - Easy to add custom tasks
Modular design - Extend components independently
CI/CD ready - Integrate into deployment pipelines
Custom validators - Implement your regulatory requirements
Exportable reports - Generate audit documentation

Best Practices for Extensions¶

Always test with n≥16 to detect drift
Use T=0.0 and explicit seeds for determinism
Document compliance mappings in audit trails
Version your prompts (metadata section)
Validate cross-provider before production deployment

Quiz: Test Your Understanding¶

Where do you add custom tasks?

Answer: run_evaluation.py - add a new top-level key with task configuration.

What's the minimum number of runs recommended for drift testing?

Answer: 16 (n=16), as used in the paper's methodology.

How do you ensure determinism in custom tasks?

Answer: Set temperature: 0.0 and seed: 42 in the template, and test consistency with multiple runs.

Next Steps¶

You've completed all workshop labs! Now you can:

Review API Reference for detailed documentation
Check Troubleshooting Guide for common issues
Read the full research paper
Contribute improvements via GitHub

Lab 6 Complete! 🎉

You've completed the entire workshop! You can now measure drift, classify models, validate cross-provider consistency, and extend the framework for your use cases. Thank you for participating!