Lab 5: Cross-Provider Testing¶
Overview¶
In this lab, you'll validate output consistency between local (Ollama) and cloud (IBM watsonx.ai) deployments using the framework's CrossProviderValidator. This ensures your models produce reliable results regardless of deployment environment.
Duration: ~30 minutes
Learning Objectives¶
By the end of this lab, you will:
- Use the
CrossProviderValidatorfrom your framework - Compare outputs between Ollama and watsonx.ai
- Understand GAAP materiality thresholds (Β±5%)
- Validate cross-provider consistency for compliance
- Make deployment decisions based on provider reliability
Prerequisites¶
- Completed Lab 4: Analyzing Drift Metrics
- At least two providers configured: Ollama + one cloud provider (watsonx.ai recommended)
- API keys in
.envfile
Why Cross-Provider Validation Matters¶
Financial institutions often need to:
- Migrate between providers without changing behavior
- Redundancy with failover to backup providers
- Vendor independence to avoid lock-in
- Regulatory compliance requiring reproducibility across environments
The Risk
A model that works locally but behaves differently in production (cloud) creates audit trail inconsistencies and compliance violations.
Step 1: Review CrossProviderValidator Code¶
Open harness/cross_provider_validation.py to see how it works:
cat harness/cross_provider_validation.py | head -50
Key features (from the code): - Normalized edit distance for text comparison - Β±5% tolerance (GAAP materiality threshold) - Task-specific validation rules - Audit trail generation
Step 2: Test Ollama vs watsonx.ai¶
Create test_cross_provider.py:
#!/usr/bin/env python3
"""
Cross-provider validation test: Ollama (local) vs watsonx.ai (cloud)
"""
import os
from openai import OpenAI
from ibm_watsonx_ai.foundation_models import ModelInference
from ibm_watsonx_ai.metanames import GenTextParamsMetaNames as GenParams
from dotenv import load_dotenv
load_dotenv()
# Test prompt (SQL generation)
prompt = "Generate SQL to find all customers with account balance > $100,000"
print("π Cross-Provider Validation Test")
print("=" * 60)
print(f"Prompt: {prompt}\n")
# Provider 1: Ollama (local)
print("π Provider 1: Ollama (qwen2.5:7b-instruct)")
ollama_client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama"
)
ollama_response = ollama_client.chat.completions.create(
model="qwen2.5:7b-instruct",
messages=[{"role": "user", "content": prompt}],
temperature=0.0,
seed=42
)
ollama_output = ollama_response.choices[0].message.content
print(f"Output: {ollama_output}\n")
# Provider 2: IBM watsonx.ai (cloud)
print("βοΈ Provider 2: watsonx.ai (granite-3-8b-instruct)")
watsonx_model = ModelInference(
model_id="ibm/granite-3-8b-instruct",
api_key=os.getenv("WATSONX_API_KEY"),
project_id=os.getenv("WATSONX_PROJECT_ID"),
url=os.getenv("WATSONX_URL", "https://us-south.ml.cloud.ibm.com")
)
watsonx_params = {
GenParams.TEMPERATURE: 0.0,
GenParams.MAX_NEW_TOKENS: 200,
GenParams.RANDOM_SEED: 42
}
watsonx_response = watsonx_model.generate_text(prompt=prompt, params=watsonx_params)
watsonx_output = watsonx_response
print(f"Output: {watsonx_output}\n")
# Compare outputs
print("=" * 60)
print("π Comparison:")
print(f" Ollama length: {len(ollama_output)} chars")
print(f" watsonx length: {len(watsonx_output)} chars")
print(f" Exact match: {ollama_output == watsonx_output}")
# Calculate similarity (Levenshtein distance)
from rapidfuzz.distance import Levenshtein
distance = Levenshtein.normalized_distance(ollama_output, watsonx_output)
similarity = 1.0 - distance
print(f" Similarity: {similarity:.1%}")
if similarity >= 0.95:
print("\nβ
Cross-provider validation PASSED (β₯95% similarity)")
else:
print(f"\nβ οΈ Cross-provider drift detected: {similarity:.1%}")
Run it:
python test_cross_provider.py
Expected output (both Tier 1 models):
π Cross-Provider Validation Test
============================================================
Prompt: Generate SQL to find all customers with account balance > $100,000
π Provider 1: Ollama (qwen2.5:7b-instruct)
Output: SELECT customer_name, account_balance FROM accounts WHERE account_balance > 100000
βοΈ Provider 2: watsonx.ai (granite-3-8b-instruct)
Output: SELECT customer_name, account_balance FROM accounts WHERE account_balance > 100000
============================================================
π Comparison:
Ollama length: 87 chars
watsonx length: 87 chars
Exact match: True
Similarity: 100.0%
β
Cross-provider validation PASSED (β₯95% similarity)
Tier 1 Cross-Provider Consistency
Both Granite-3-8B (watsonx) and Qwen2.5-7B (Ollama) produce identical outputsβenabling seamless migration between local and cloud deployments.
Step 3: Use the Framework's CrossProviderValidator¶
Now use the built-in validator from harness/cross_provider_validation.py:
Create run_cross_provider_validation.py:
#!/usr/bin/env python3
"""
Use framework's CrossProviderValidator for automated testing.
Note: The validator works with pre-collected outputs. You first run
experiments on each provider, then validate the outputs for consistency.
"""
from harness.cross_provider_validation import CrossProviderValidator
# Initialize validator with GAAP materiality threshold
validator = CrossProviderValidator(
providers=["ollama", "watsonx"],
tolerance_pct=5.0 # +/-5% from GAAP auditing standards
)
# Assume you've already collected outputs from each provider
# (e.g., from running run_evaluation.py with --providers ollama and --providers watsonx)
sql_outputs = {
"ollama": "SELECT customer_id, name, balance FROM accounts WHERE balance > 100000;",
"watsonx": "SELECT customer_id, name, balance FROM accounts WHERE balance > 100000;"
}
# Validate SQL outputs
result_sql = validator.validate(sql_outputs, task_type="sql")
print("\nSQL Generation Task")
print("=" * 60)
print(f"Consistent: {result_sql['consistent']}")
print(f"Similarity: {result_sql['similarity_scores']}")
print(f"Validation: {'PASS' if result_sql['consistent'] else 'FAIL'}")
# Validate RAG outputs
rag_outputs = {
"ollama": "Citigroup reported net credit losses of $1.2B in 2023. [citi_2024_10k]",
"watsonx": "Citigroup reported net credit losses of $1.2 billion in 2023. [citi_2024_10k]"
}
rag_citations = {
"ollama": ["citi_2024_10k"],
"watsonx": ["citi_2024_10k"]
}
result_rag = validator.validate(
rag_outputs, task_type="rag", citations=rag_citations
)
print("\nRAG Task")
print("=" * 60)
print(f"Consistent: {result_rag['consistent']}")
print(f"Similarity: {result_rag['similarity_scores']}")
print(f"Validation: {'PASS' if result_rag['consistent'] else 'MINOR DRIFT'}")
# Audit trail
print("\nCross-Provider Audit Report")
print("=" * 60)
for record in result_sql['audit_trail']:
print(f"Providers: {record['providers']}")
print(f"Output hashes: {record['output_hashes']}")
Run it:
python run_cross_provider_validation.py
Step 4: GAAP Materiality Threshold (Β±5%)¶
The framework uses Β±5% tolerance based on GAAP auditing standards for financial statement materiality.
Example: Numeric comparison
def validate_numeric_tolerance(value1: float, value2: float, tolerance_pct: float = 5.0) -> bool:
"""Check if two values are within GAAP materiality threshold."""
if value1 == 0 and value2 == 0:
return True
if value1 == 0 or value2 == 0:
return False
diff_pct = abs(value1 - value2) / max(value1, value2) * 100
return diff_pct <= tolerance_pct
# Test cases
print(validate_numeric_tolerance(2.4, 2.5, tolerance_pct=5.0)) # True (4.2% diff)
print(validate_numeric_tolerance(100, 110, tolerance_pct=5.0)) # False (9.1% diff)
print(validate_numeric_tolerance(1000, 1040, tolerance_pct=5.0)) # True (3.8% diff)
Why 5%? - GAAP materiality standard for financial reporting - Industry-accepted threshold for immaterial differences - Balances strictness with practical variance
Step 5: Multi-Run Cross-Provider Test¶
Test consistency across multiple runs (n=5):
#!/usr/bin/env python3
"""
Multi-run cross-provider consistency test.
"""
from harness.cross_provider_validation import CrossProviderValidator
validator = CrossProviderValidator(providers=["ollama", "watsonx"], tolerance_pct=5.0)
prompt = "Generate SQL to find all customers with account balance > $100,000"
results = []
for i in range(1, 6):
result = validator.validate(
prompt=prompt,
task_type="sql",
model_ollama="qwen2.5:7b-instruct",
model_watsonx="ibm/granite-3-8b-instruct",
temperature=0.0,
seed=42
)
results.append(result['consistent'])
print(f"Run {i}: {'β
Consistent' if result['consistent'] else 'β Inconsistent'}")
consistency_rate = sum(results) / len(results) * 100
print(f"\nOverall consistency: {consistency_rate:.0f}%")
Expected output:
Run 1: β
Consistent
Run 2: β
Consistent
Run 3: β
Consistent
Run 4: β
Consistent
Run 5: β
Consistent
Overall consistency: 100%
Step 6: Migration Decision Matrix¶
Based on cross-provider validation, decide whether migration is safe:
| Scenario | Ollama β watsonx | Validation | Safe to Migrate? |
|---|---|---|---|
| SQL (Tier 1 β Tier 1) | Qwen2.5-7B β Granite-3-8B | 100% match | β Yes |
| RAG (Tier 1 β Tier 1) | Qwen2.5-7B β Granite-3-8B | β₯95% match | β Yes |
| SQL (Tier 1 β Tier 2) | Qwen2.5-7B β Llama-3.3-70B | 100% match | β Yes |
| RAG (Tier 1 β Tier 2) | Qwen2.5-7B β Llama-3.3-70B | <95% match | β οΈ Monitor |
| Any (Tier 1 β Tier 3) | Qwen2.5-7B β GPT-OSS-120B | <50% match | β No |
Migration safety check:
def is_migration_safe(source_tier: int, target_tier: int, task_type: str) -> bool:
"""Check if migration between providers is compliance-safe."""
if source_tier == 1 and target_tier == 1:
return True # Always safe: Tier 1 β Tier 1
if target_tier == 3:
return False # Never safe: Any β Tier 3
if target_tier == 2 and task_type in ["sql", "summarize"]:
return True # Safe for structured tasks
return False # Requires validation
# Examples
print(is_migration_safe(1, 1, "rag")) # True
print(is_migration_safe(1, 2, "sql")) # True
print(is_migration_safe(1, 2, "rag")) # False (requires validation)
print(is_migration_safe(1, 3, "sql")) # False
Understanding Provider Differences¶
Even with identical model versions, providers may differ in:
- Infrastructure: GPU hardware, CUDA versions
- Quantization: Different precision (FP16, FP32, INT8)
- Batching: Request handling and parallelization
- Load balancing: Multiple model replicas
Tier 1 Advantage
Tier 1 models (7-20B) are small enough to fit on a single GPU consistently, reducing infrastructure-induced variance.
Troubleshooting¶
watsonx.ai Connection Issues¶
# Test watsonx.ai connectivity
from ibm_watsonx_ai.foundation_models import ModelInference
import os
try:
model = ModelInference(
model_id="ibm/granite-3-8b-instruct",
api_key=os.getenv("WATSONX_API_KEY"),
project_id=os.getenv("WATSONX_PROJECT_ID"),
url=os.getenv("WATSONX_URL")
)
print("β
watsonx.ai connection successful")
except Exception as e:
print(f"β watsonx.ai connection failed: {e}")
Similarity Below 95%¶
If cross-provider similarity is unexpectedly low:
- Check model versions: Ensure same base model
- Verify temperature: Must be exactly 0.0
- Use explicit seeds: Set
seed=42for both - Inspect raw outputs: Look for formatting differences
# Debug output differences
print("Ollama output:", repr(ollama_output))
print("watsonx output:", repr(watsonx_output))
Key Takeaways¶
- Cross-provider validation ensures migration safety
- Tier 1 models (7-20B) achieve perfect cross-provider consistency
- GAAP materiality (Β±5%) provides finance-calibrated tolerance
- Framework's
CrossProviderValidatorautomates testing - Audit trails document cross-provider equivalence
Quiz: Test Your Understanding¶
What is the GAAP materiality threshold used in cross-provider validation?
Answer: Β±5%, based on GAAP auditing standards for financial statement materiality.
Why do Tier 1 models show better cross-provider consistency?
Answer: They're small enough (7-20B params) to fit on a single GPU, reducing infrastructure-induced variance from distributed processing.
When is migration from Tier 1 to Tier 2 safe?
Answer: Only for structured tasks (SQL, summarization). RAG tasks require explicit validation due to Tier 2's lower RAG consistency.
Next Steps¶
Now that you understand cross-provider validation:
- Proceed to Lab 6: Extending the Framework to add custom tasks
- Test your own provider combinations
- Review
harness/cross_provider_validation.pyfor implementation details
Lab 5 Complete!
You can now validate cross-provider consistency and make migration decisions with confidence! Ready to customize the framework? Move on to Lab 6: Extending the Framework!