Journey Log | IBM Client Engineering

Log 35 🛫

June 6, 2024 · One min read

Objective

Deploy watsonx.ai on self-managed AWS infrastructure for customer software evaluation

Milestones

Deploy and configuration of boot node to establish a beach-head into the customer AWS environment
- Complete
Deploy OCP using the documented UPI installation steps
- Complete
Install Cloud Pak for Data
- Complete
Deploy and configure watsonx.ai on self-managed AWS infrastructure on ref environment and document
- In Progress

Today's Accomplishments

Further configuration of watsonx Assistant

Summary

Configuring NeuralSeek
- Error occuring during testing "Error, please refresh the page and try again"
  - Issue found - old API key being used in the configuration due to unexpected API key retainment in configuration output for customer
Issue has been resolved, continuing configuration of NeuralSeek

Decisions and Action Items (DAI)

Investigation of cluster proxy configuration (post POC)
Provide a demonstration of ServiceNow/Outlook integration via Orchestrate and Assistant

Lessons Learned

In NeuralSeek, the API key is retained in configuration exports. The latest configuration had an old key

Next Steps

Application verification

Log 34 🛫

June 5, 2024 · One min read

Objective

Deploy watsonx.ai on self-managed AWS infrastructure for customer software evaluation

Milestones

Deploy and configuration of boot node to establish a beach-head into the customer AWS environment
- Complete
Deploy OCP using the documented UPI installation steps
- Complete
Install Cloud Pak for Data
- Complete
Deploy and configure watsonx.ai on self-managed AWS infrastructure on ref environment and document
- In Progress

Today's Accomplishments

Further configuration of watsonx Assistant

Summary

Pivoting from external integrations
- Unable to configure proxy to reach out to external sites from OCP
Continuing configuration of assistant actions and non-extenral connections
- Assistant configured
Continuing configuration of NeuralSeek

Decisions and Action Items (DAI)

Investigation of cluster proxy configuration (post POC)
Provide a demonstration of ServiceNow/Outlook integration via Orchestrate and Assistant

Lessons Learned

Proxy configuration for cluster required for application access to external sources - To be investigated post-POC

Next Steps

Application verification

Log 33 🛫

May 31, 2024 · 2 min read

Objective

Deploy watsonx.ai on self-managed AWS infrastructure for customer software evaluation

Milestones

Deploy and configuration of boot node to establish a beach-head into the customer AWS environment
- Complete
Deploy OCP using the documented UPI installation steps
- Complete
Install Cloud Pak for Data
- Complete
Deploy and configure watsonx.ai on self-managed AWS infrastructure on ref environment and document
- In Progress

Today's Accomplishments

Summary

Brief watsonx.ai Prompt Lab demonstration
Reconfiguring watsonx Assistant front end (per customer request, to be worked on)
Continuing proxy investigation for the cluster
- Applying patch to set environment variables for the pods via RSI patch IBM documentation here

Decisions and Action Items (DAI)

ServiceNow connectivity being investigated
- Issue with proxy configuration not allowing watsonx Assistant/Orchestrate communication with ServiceNow.com
watsonx Assistant front-end enhancements with customer
- Chat greeting enhancements

Lessons Learned

pdf files needed for watson Assistant extensions available to the cluster internally (no external access or configurable access to S3 buckets)
- Workaround: Hosted pdf files on the bastion httpd server (originally used for the OCP ignition files)

Next Steps

Application configuration
- watsonx.ai Prompt Lab
- watsonx Assistant
- watsonx Orchestrate
  - ServiceNow skills
  - Microsoft Outlook skills

Log 32 🛫

May 30, 2024 · 2 min read

Objective

Deploy watsonx.ai on self-managed AWS infrastructure for customer software evaluation

Milestones

Deploy and configuration of boot node to establish a beach-head into the customer AWS environment
- Complete
Deploy OCP using the documented UPI installation steps
- Complete
Install Cloud Pak for Data
- Complete
Deploy and configure watsonx.ai on self-managed AWS infrastructure on ref environment and document
- In Progress

Summary

Attempting proxy reconfiguration for the cluster

# Customer data redacted
kubectl get pods -n cpd -o json | jq -r ‘.items[].metadata.labels[“app.kubernetes.io/instance”]’ | sort | uniq
cpd-cli manage enable-rsi --cpd_instance_ns=${PROJECT_CPD_INST_OPERANDS}
cat << EOF >> cpfs-proxy-envs.json
[ {
    “name”:  “HTTP_PROXY”,
    “value”: “http://http.proxy.customer.com:8000/”
}, {
    “name”:  “http_proxy”,
    “value”: “http://http.proxy.customer.com:8000/”
}, {
    “name”:  “HTTPS_PROXY”,
    “value”: “http://http.proxy.customer.com:8000/”
}, {
    “name”:  “https_proxy”,
    “value”: “http://http.proxy.customer.com:8000/”
}, {
    “name”:  “NO_PROXY”,
    “value”: “.aws-nonprod.xxx.com,.ibm-wxai.aws-nonprod.customer.com.apps.ibm-wxai.aws-nonprod.xxx.com,172.30.0.0/16,10.128.0.0/14,10.19.170.0/25”
}, {
    “name”:  “no_proxy”,
    “value”: “.aws-nonprod.xxx.com,.ibm-wxai.aws-nonprod.customer.com.apps.ibm-wxai.aws-nonprod.xxx.com,172.30.0.0/16,10.128.0.0/14,10.19.170.0/25”
} ]
EOF
cpd-cli manage create-rsi-patch \
    --cpd_instance_ns=cpd \
    --patch_name=“cpfs-proxy” \
    --description=“add proxy settings to CPD” \
    --patch_type=rsi_pod_env_var \
    --patch_spec=/tmp/work/rsi/cpfs-proxy-envs.json \
    --spec_format=set-env \
    --select_all_pods=true \
    --include_labels=wa \
    --exclude_labels=app:rsi \
    --skip_apply=false \
    --state=active

New configuration requires watsonx Assistant pod reboot, waiting until after further application configurations
Continuing to configure NeuralSeek assistant

Decisions and Action Items (DAI)

ServiceNow connectivity being investigated
- Issue with proxy configuration not allowing watsonx Assistant/Orchestrate communication with ServiceNow.com

Lessons Learned

pdf files needed for watson Assistant extensions available to the cluster internally (no external access or configurable access to S3 buckets)
- Workaround: Hosted pdf files on the bastion httpd server (originally used for the OCP ignition files)

Next Steps

Application configuration
- watsonx.ai Prompt Lab
- watsonx Assistant
- watsonx Orchestrate
  - ServiceNow skills
  - Microsoft Outlook skills

Log 31 🛫

May 29, 2024 · 2 min read

Objective

Deploy watsonx.ai on self-managed AWS infrastructure for customer software evaluation

Milestones

Deploy and configuration of boot node to establish a beach-head into the customer AWS environment
- Complete
Deploy OCP using the documented UPI installation steps
- Complete
Install Cloud Pak for Data
- Complete
Deploy and configure watsonx.ai on self-managed AWS infrastructure on ref environment and document
- In Progress

Today's Accomplishments

Successful configuration of wastsonx Assistant with PDF lookups
Successful validation of NeuralSeek souce attribution

Summary

TS016344977 opened with IBM support to investigate watson discovery EDB cluster issue
- Watson Discovery has an EDB cluster that has only 1/2 pods running, with one pod stuck in CrashLoopBackoff
- Troubleshooting with support
- Support identified an issue with the Postgress Database
  - Postgress database not running putting the cluster in an unhealthy state
    {"level": "info", "ts":"2024-05-29119:39:062", "logger": "pg_rewind", "msg": "pg_rewind: connected to server", "pipe": "stderr", "logging-pod": "wd-discovery-cn-postgres-1"}
    {"level": "info", "ts":"2024-05-29119:39:062", "logger": "pq rewind", "msg": "pg_rewind: fatal: target server must be shut down cleanly" "pipe":"stderr" "logging_pod": "wd-discovery-cn-postqres-1"}
  - Engaging Postgress support
Continuing watsonx Assistant configuration
- Hosting response PDFs on the bastion httpd server
- wastsonx Assistant with PDF lookup configured
Continuing NeuralSeek configuration
- Test questions verified
Retrying ServiceNow Skills
- Configuring proxy in the skill yaml/json
- Same timeout error when trying skill, continuing investigation

Decisions and Action Items (DAI)

ServiceNow connectivity being investigated
- Issue with proxy configuration not allowing watsonx Assistant/Orchestrate communication with ServiceNow.com

Lessons Learned

pdf files needed for watson Assistant extensions available to the cluster internally (no external access or configurable access to S3 buckets)
- Workaround: Hosted pdf files on the bastion httpd server (originally used for the OCP ignition files)

Next Steps

Application configuration
- watsonx.ai Prompt Lab
- watsonx Assistant
- watsonx Orchestrate
  - ServiceNow skills
  - Microsoft Outlook skills

Log 30 🛫

May 28, 2024 · 2 min read

Objective

Deploy watsonx.ai on self-managed AWS infrastructure for customer software evaluation

Milestones

Deploy and configuration of boot node to establish a beach-head into the customer AWS environment
- Complete
Deploy OCP using the documented UPI installation steps
- Complete
Install Cloud Pak for Data
- Complete
Deploy and configure watsonx.ai on self-managed AWS infrastructure on ref environment and document
- In Progress

Summary

Testing NeuralSeek functionality
NeuralSeek verified opperational
Configuring watsonx Assistant ServiceNow extension
- Unable to access ServiceNow, getting blocked outbound
- Investigating AWS and Cluster network connectivity
  - Curl requests working from bastion, not from application or cluster
- Issue found: Customer proxy needs to be used in all API requests (even though cluster is configured to use proxy)
  - ~~Proposed workaround: Add proxy to OpenAPI spec for watsonx Assistant and Orchestrate~~
- Fix: Configure cluster environments to use customer proxy
```
oc set env deployment --all http_proxy=http://my_proxy:my_port
oc set env deployment --all https_proxy=http://my_proxy:my_port
```

Decisions and Action Items (DAI)

ServiceNow connectivity being investigated
- Issue with proxy configuration not allowing watsonx Assistant/Orchestrate communication with ServiceNow.com

Lessons Learned

If a proxy is required, the proxy configuration needs to be applied to the deployment environments for cluster applications to use the proxy. Initially the proxy was only applied to the cluster itself, not the deployment environments.

Next Steps

Application configuration
- NeuralSeek
- watsonx.ai Prompt Lab
- watsonx Assistant
- watsonx Orchestrate
  - ServiceNow skills
  - Microsoft Outlook skills

Log 29 🛫

May 22, 2024 · 2 min read

Objective

Deploy watsonx.ai on self-managed AWS infrastructure for customer software evaluation

Milestones

Deploy and configuration of boot node to establish a beach-head into the customer AWS environment
- Complete
Deploy OCP using the documented UPI installation steps
- Complete
Install Cloud Pak for Data
- Complete
Deploy and configure watsonx.ai on self-managed AWS infrastructure on ref environment and document
- In Progress

Summary

Continuing to investigate NeuralSeek configuration / CP4D
- Support team reviewing CP4D install and connectivity
- Potential issues found on customer network blocking API connectivity Run port forward service on port 8888
```
oc port-forward service/ibm-nginx-svc 8888:443
```
  From a new terminal on the same node:
```
ss -tlnp
```
  Verified 8888 listenting on localhost
```
# Custom support js
curl -k -LO https://127.0.0.1:8888/common-nav/api/nls/login-nls.js
```
  Output shows Pod recieving required files Something in between the pods that disallows authenticated traffic
- Customer to engage network team to assess / log network traffic via AWS
Issue resolved
- There were eronious EC2 instances in the AWS load balencer from previous configurations that interfered with cluster communication
- Issue was not constant and only appeared when network traffic was sent to an incorrect destination (non-existant) by the load balencer
Ready to continue application verification

Decisions and Action Items (DAI)

None pending

Lessons Learned

Cleaning previous networking configurations

Next Steps

Application configuration
- NeuralSeek
- watsonx.ai Prompt Lab
- watsonx Assistant
- watsonx Orchestrate
  - ServiceNow skills
  - Microsoft Outlook skills

Log 28 🛫

May 21, 2024 · 2 min read

Objective

Deploy watsonx.ai on self-managed AWS infrastructure for customer software evaluation

Milestones

Deploy and configuration of boot node to establish a beach-head into the customer AWS environment
- Complete
Deploy OCP using the documented UPI installation steps
- Complete
Install Cloud Pak for Data
- Complete
Deploy and configure watsonx.ai on self-managed AWS infrastructure on ref environment and document
- In Progress

Summary

Continuing to investigate NeuralSeek configuration / CP4D
- Reset all pods that are used in the CP4D login flow using the following commands:
```
oc delete pod -n cpd -l component=ibm-nginx
```
```
oc delete pod -n cpd -l component=usermgmt
```
```
oc delete pod -n cpd -l component=zen-core
```
```
oc delete pod -n cpd -l component=zen-core-api
```
- Confirmed metastore-db is also healthy using this command:
```
oc get cluster.postgresql.k8s.enterprisedb.io zen-metastore-edb -n cpd
```
- Still unable to login and browse CP4D consistently
- Errors found in usermgmt pods that appear after failed page loads in CP4D
- Successful use of APIs to query CP4D for a Bearer Token, to list project IDs, and to even generate text from WatsonX
- Neuralseek was able to query WatsonX, however it couldnt interact with Watson Discovery.
- Support ticket was opened previously, updated this morning to include screenshots of the error logs found in usermgmt pods. Ticket has been upgraded to Sev1
- Pods verified
- Engaging support

Decisions and Action Items (DAI)

None pending

Lessons Learned

None today

Next Steps

Application configuration
- NeuralSeek
- watsonx.ai Prompt Lab
- watsonx Assistant
- watsonx Orchestrate
  - ServiceNow skills
  - Microsoft Outlook skills

Log 27 🛫

May 20, 2024 · One min read

Objective

Deploy watsonx.ai on self-managed AWS infrastructure for customer software evaluation

Milestones

Deploy and configuration of boot node to establish a beach-head into the customer AWS environment
- Complete
Deploy OCP using the documented UPI installation steps
- Complete
Install Cloud Pak for Data
- Complete
Deploy and configure watsonx.ai on self-managed AWS infrastructure on ref environment and document
- In Progress

Summary

Investigating NeuralSeek configuration
- Curl request from application and bastion failing
- Nodes missing machineconfigs, waiting for machineconfigs to propegate through all nodes

Decisions and Action Items (DAI)

None pending

Lessons Learned

None today

Next Steps

Application configuration
- NeuralSeek
- watsonx.ai Prompt Lab
- watsonx Assistant
- watsonx Orchestrate
  - ServiceNow skills
  - Microsoft Outlook skills

Log 26 🛫

May 16, 2024 · One min read

Objective

Deploy watsonx.ai on self-managed AWS infrastructure for customer software evaluation

Milestones

Deploy and configuration of boot node to establish a beach-head into the customer AWS environment
- Complete
Deploy OCP using the documented UPI installation steps
- Complete
Install Cloud Pak for Data
- Complete
Deploy and configure watsonx.ai on self-managed AWS infrastructure on ref environment and document
- In Progress

Today's Accomplishments

Installed NeuralSeek
Installed watsonx Orchestrate

Summary

Installing NeuralSeek
S3 Bucket Configured

Installing watsonx Orchestrate

Removing/recreating MongoDB pod

Applied required patch from 'known issues' list (IBM internal documentation)

INSTANCE_NAME=$(oc -n <cpd-namespace> get wa --output jsonpath='{.items[0].metadata.name}')

oc patch wa ${INSTANCE_NAME} --type='merge' -p='{"configOverrides":{"store":{"extra_vars": {"store": {"MAX_NEW_IA_ASSISTANTS":"1000","MAX_NEW_IA_SKILLS":"300000","ASSISTANT_MAX_PAGE_LIMIT":"1000"}}}}}'

Decisions and Action Items (DAI)

None pending

Lessons Learned

None today

Next Steps

Application configuration
- NeuralSeek
- watsonx.ai Prompt Lab
- watsonx Assistant
- watsonx Orchestrate
  - ServiceNow skills
  - Microsoft Outlook skills

Log 25 🛫

May 15, 2024 · One min read

Objective

Deploy watsonx.ai on self-managed AWS infrastructure for customer software evaluation

Milestones

Deploy and configuration of boot node to establish a beach-head into the customer AWS environment
- Complete
Deploy OCP using the documented UPI installation steps
- Complete
Install Cloud Pak for Data
- Complete
Deploy and configure watsonx.ai on self-managed AWS infrastructure on ref environment and document
- In Progress

Today's Accomplishments

AppConnect installed

Summary

Installing AppConnect
Installing Watsonx Orchestrate (Includes assistant instance)
- Customer continuing install after session

Decisions and Action Items (DAI)

None pending

Lessons Learned

None today

Next Steps

Deploy watsonx.ai
Install NeuralSeek
Application configuration
- NeuralSeek
- watsonx.ai Prompt Lab
- watsonx Assistant
- watsonx Orchestrate
  - ServiceNow skills
  - Microsoft Outlook skills

Log 24 🛫

May 14, 2024 · 2 min read

Objective

Deploy watsonx.ai on self-managed AWS infrastructure for customer software evaluation

Milestones

Deploy and configuration of boot node to establish a beach-head into the customer AWS environment
- Complete
Deploy OCP using the documented UPI installation steps
- Complete
Install Cloud Pak for Data
- Complete
Deploy and configure watsonx.ai on self-managed AWS infrastructure on ref environment and document
- In Progress

Today's Accomplishments

Successful installation of Llama model

Summary

Installing and troubleshooting remaining LLMs for watsonx.ai
- Granite model installed (previous session)
- Issues with installation of Mistral and Llama models
  - Investigating Mistral model compatibility
    - Determined incompatible with p4d.24xlarge a100-40gb
  - Investigating work around for instance type compatability for Llama model p4d.24xlarge a100-40gb in US-East-2

Llama installed and running using workaround (Workaround source https://github.ibm.com/NGP-TWC/ml-planning/issues/37189)

#Add the following to watsonxaiifm-cr:
meta_llama_llama_2_70b_chat:
deployment_yaml_name: llama-2-70b-chat.yaml.j2
pvc_name: meta-llama-llama-2-70b-chat-pvc
svc_name: llama-2-70b-chat
pvc_size: 150Gi
dir_name: models--meta-llama--llama-2-70b-chat
model_name: /watsonxaiifm-models/models--meta-llama--llama-2-70b-chat
cuda_visible_devices: 0,1,2,3,4,5,6,7
model_root_dir: /watsonxaiifm-models
flash_attention: true
deployment_framework: hf_custom_tp
dtype_str: float16
max_batch_size: 128
max_concurrent_requests: 160
max_batch_weight: 200000
max_sequence_length: 4096
max_prefill_weight: 60000
max_new_tokens: 4096
num_shards: 8
hf_modules_cache: /tmp/huggingface/modules
force_apply: no
meta_llama_llama_2_70b_chat_resources:
limits:
cpu: “8”
memory: 246Gi
    nvidia.com/gpu: “8"
ephemeral-storage: 1Gi
requests:
cpu: “1”
memory: 240Gi
    nvidia.com/gpu: “8"
ephemeral-storage: 10Mi

#Add the following to watsonxai-cr:
tuning_disabled: true

Decisions and Action Items (DAI)

Mistral installation
- Incompatable with a100-40gb

Lessons Learned

LLM model compatabilities with provisioned resources
- Most IBM LLM's require a100-80gb

Next Steps

Deploy watsonx.ai
Install NeuralSeek
Application configuration
- NeuralSeek
- watsonx.ai Prompt Lab
- watsonx Assistant
- watsonx Orchestrate
  - ServiceNow skills
  - Microsoft Outlook skills

Log 23 🛫

May 13, 2024 · 2 min read

Objective

Deploy watsonx.ai on self-managed AWS infrastructure for customer software evaluation

Milestones

Deploy and configuration of boot node to establish a beach-head into the customer AWS environment
- Complete
Deploy OCP using the documented UPI installation steps
- Complete
Install Cloud Pak for Data
- Complete
Deploy and configure watsonx.ai on self-managed AWS infrastructure on ref environment and document
- In Progress

Today's Accomplishments

GPU node provisioned
Deployment of Granite model

Summary

Provisioning GPU node
- Installing GPU operator
Installing watsonx.ai operator
- Waiting on the installer
Facing issues with Mistral IBM models for watsonx.ai - support case open

Decisions and Action Items (DAI)

MCG Secrets created for Cloud Pak components
Authorized Instance Topology
Installed Cloud Pak shared components
Installed Knative
GPU Node Activity and Billing

Lessons Learned

GPU Node Activity
- AWS was charging for a GPU Node while powered-down
- Provisioned the GPU Node using a reserve instance (30 days starting May 6)

Next Steps

Deploy watsonx.ai
Install NeuralSeek
Application configuration
- NeuralSeek
- watsonx.ai Prompt Lab
- watsonx Assistant
- watsonx Orchestrate
  - ServiceNow skills
  - Microsoft Outlook skills

Log 22 🛫

May 10, 2024 · One min read

Objective

Deploy watsonx.ai on self-managed AWS infrastructure for customer software evaluation

Milestones

Deploy and configuration of boot node to establish a beach-head into the customer AWS environment
- Complete
Deploy OCP using the documented UPI installation steps
- Complete
Install Cloud Pak for Data
- Complete
Deploy and configure watsonx.ai on self-managed AWS infrastructure on ref environment and document
- In Progress

Today's Accomplishments

Successful installation of watsonx Discovery
Successful installation of watson Studio
Successful installation of watson Machine Learning
Successful installation of IBM Knowledge Catalog

Summary

Trobuleshooting watsonx Assistant installation pods
- Starting required pods/containers

Decisions and Action Items (DAI)

None today

Lessons Learned

None today

Next Steps

Deploy watsonx.ai
Application configurations
Application validations

Log 21 🛫

May 9, 2024 · 3 min read

Objective

Deploy watsonx.ai on self-managed AWS infrastructure for customer software evaluation

Milestones

Deploy and configuration of boot node to establish a beach-head into the customer AWS environment
- Complete
Deploy OCP using the documented UPI installation steps
- Complete
Install Cloud Pak for Data
- In Progress
Deploy and configure watsonx.ai on self-managed AWS infrastructure on ref environment and document
- In Progress

Today's Accomplishments

Successful deployment of CP4D
Installed Knative

Summary

Customer has approved required contracts and procedures have been followed to attain an entitlement key
Updated bootnode IP reference in configuration
Re-ran install scripts
MCG Secrets created for Cloud Pak components
Verify cluster up v4.12.8
- No errors
- Check storage size on nodes, have 5TB disks instead of 500gb as intended
  - This was set incorrectly in config, reconfiguring worker nodes with proper (worker-template.yaml)
- Don't need secondary disks on the nodes, NFS will be used instead
Adding GPU node
- Updating config.sh, gpu_subnet accurate, security groups set properly
- Logging into AWS via aws_sso
- Running add_node.sh to add gpu node (runs for about 10 min)
- Verifying node draining and uncordon node
Installing nfs provisioner
- Operator install unsuccessful
  - Fallback to using helm install
- Downloading helm repo and install
- Tested and verified with test pod that attached to nfs-client storageclass, applying clean up

Decisions and Action Items (DAI)

Authorized Instance Topology

Lessons Learned

Had an issue with “apply-cluster-components” which requires connecting to github to download CASE files. Found a solution in the cpd-cli documentation: use two additioanl flags on the command “--case_download=true” and “--from_oci=true” which tells cpd-cli to download the CASE files from IBM Open Container Initiative instead of github.
While running “setup-instance-topology” for knative, received an error regarding storage. Added “--block_storage_class=${STG_CLASS_BLOCK}” to the command and it completed successfully.

Next Steps

License and configure Cloud Pak for Data
- Cloud Pak Considerations
  - Security scans needed on container images
  - Customer requires on-prem, offline install
  - Customer uses their own container registry that might introduce extra effort or compatability issues
  - Version compatibility with OpenShift (e.g. 4.10 required and customer has 4.11)
  - Supported storage not available
  - Multiple cloudpaks on the same cluster
  - custom connections to data sources not supported OOTB
  - AWS-specific: IAM users required for install/deploy and are not allowed
  - OpenShift specific: CoreOS requirement for control nodes
  - Automatic updating of Cloud Pak, this can interrupt engagements (solution is to always remove update polling from operators)
Deploy watsonx.ai
Application configurations
Application validations

Log 20 🛫

May 8, 2024 · 2 min read

Objective

Deploy watsonx.ai on self-managed AWS infrastructure for customer software evaluation

Milestones

Deploy and configuration of boot node to establish a beach-head into the customer AWS environment
- Complete
Deploy OCP using the documented UPI installation steps
- Complete
Install Cloud Pak for Data
- In Progress
Deploy and configure watsonx.ai on self-managed AWS infrastructure on ref environment and document
- In Progress

Summary

Customer has approved required contracts and procedures have been followed to attain an entitlement key.

Decisions and Action Items (DAI)

Customer has worked with us to spin up a new cluster
- Previous cluster had been deleted to save AWS credits
- IBM to provide tighter instruction for the deployment of CP4D
Customer received a GPU reservation
- GPU node has been ordered and deployed
- Costs associated with GPU resources are discounted, but the meter is running once the reservation is accepted.

Lessons Learned

watsonx.ai service requires larger local disks on worker nodes (500Gb)
The GPU node required for watsonx.ai seems to be a limited resource
Had to replace the nodes in the cluster as the attached disks were incorrect

Next Steps

License and configure Cloud Pak for Data
- Cloud Pak Considerations
  - Security scans needed on container images
  - Customer requires on-prem, offline install
  - Customer uses their own container registry that might introduce extra effort or compatability issues
  - Version compatibility with OpenShift (e.g. 4.10 required and customer has 4.11)
  - Supported storage not available
  - Multiple cloudpaks on the same cluster
  - custom connections to data sources not supported OOTB
  - AWS-specific: IAM users required for install/deploy and are not allowed
  - OpenShift specific: CoreOS requirement for control nodes
  - Automatic updating of Cloud Pak, this can interrupt engagements (solution is to always remove update polling from operators)
Deploy watsonx.ai
Application configurations
Application validations

Log 19 🛫

April 12, 2024 · 2 min read

Objective

Deploy watsonx.ai on self-managed AWS infrastructure for customer software evaluation

Milestones

Deploy and configuration of boot node to establish a beach-head into the customer AWS environment
- Complete
Deploy OCP using the documented UPI installation steps
- Complete
Install Cloud Pak for Data
- In Progress
Deploy and configure watsonx.ai on self-managed AWS infrastructure on ref environment and document
- In Progress

Summary

Awaiting entitlement key approval on customer side

Decisions and Action Items (DAI)

Software evaluation awaiting customer's approval process. This blocks our ability to download software from cp.icr.io
- Customer to provide by EOD Monday
Worker nodes shutdown until approval comes through
Drafted and sent instructions for the customer to resize the worker node disks for when the cluster is brought back online
Drafted and sent instructions for the customer to order a GPU Node
- GPU node to be added to the cluster and then cordoned, drained, and shutdown

Lessons Learned

Preparation for Cloud Pak for Data on OpenShift sizing needed to be adjusted to reflect an under-provisioning of CPU resources
watsonx.ai service requires larger local disks on worker nodes (500Gb)
The GPU node required for watsonx.ai seems to be a limited resource

Next Steps

License and configure Cloud Pak for Data
- Cloud Pak Considerations
  - Security scans needed on container images
  - Customer requires on-prem, offline install
  - Customer uses their own container registry that might introduce extra effort or compatability issues
  - Version compatibility with OpenShift (e.g. 4.10 required and customer has 4.11)
  - Supported storage not available
  - Multiple cloudpaks on the same cluster
  - custom connections to data sources not supported OOTB
  - AWS-specific: IAM users required for install/deploy and are not allowed
  - OpenShift specific: CoreOS requirement for control nodes
  - Automatic updating of Cloud Pak, this can interrupt engagements (solution is to always remove update polling from operators)
- Resize local disks for worker nodes
- Customer to order a GPU node and attach it to the cluster
Deploy watsonx.ai
Application configurations
Application validations

Log 18 🛫

April 5, 2024 · 2 min read

Objective

Deploy watsonx.ai on self-managed AWS infrastructure for customer software evaluation

Milestones

Deploy and configuration of boot node to establish a beach-head into the customer AWS environment
- Complete
Deploy OCP using the documented UPI installation steps
- Complete
Install Cloud Pak for Data
- In Progress
Deploy and configure watsonx.ai on self-managed AWS infrastructure

Today's Acomplishments

CP4D Final Preparations
- Added options to the CPD VARS file
- Recreation of work dir

Summary

Trobuleshooting the CP4D CLI
Awaiting entitlement key approval

Decisions and Action Items (DAI)

Software evaluation awaiting customer's approval process. This blocks our ability to download software from cp.icr.io
- Customer to provide by EOD Monday

Lesons Learned

Preparation for Cloud Pak for Data on OpenShift sizing needed to be adjusted to reflect an under-provisioning of CPU resources

Objective

Start building guided workflows
Attempt to improve parsing of unstructured tables with WDU (Watson Document Understanding)

Milestones

Designed some guided workflow concept ideas
Coded flask app to expose an API to send emails to users from the Watsonx Assistant
Table parsing with WDU sucesfully configured

Next Steps

Integrate agent based workflows into guided workflows (langchain agents)
Investigate if its possible to improve table parsing
License and configure Cloud Pak for Data
- Cloud Pak Considerations
  - Security scans needed on container images
  - Customer requires on-prem, offline install
  - Customer uses their own container registry that might introduce extra effort or compatability issues
  - Version compatibility with OpenShift (e.g. 4.10 required and customer has 4.11)
  - Supported storage not available
  - Multiple cloudpaks on the same cluster
  - custom connections to data sources not supported OOTB
  - AWS-specific: IAM users required for install/deploy and are not allowed
  - OpenShift specific: CoreOS requirement for control nodes
  - Automatic updating of Cloud Pak, this can interrupt engagements (solution is to always remove update polling from operators)
Deploy watsonx.ai
Application configurations
Application validations

Log 17 🛫

April 4, 2024 · 2 min read

Objective

Deploy watsonx.ai on self-managed AWS infrastructure for customer software evaluation

Milestones

Deploy and configuration of boot node to establish a beach-head into the customer AWS environment
- Complete
Deploy OCP using the documented UPI installation steps
- Complete
Install Cloud Pak for Data
- In Progress
Deploy and configure watsonx.ai on self-managed AWS infrastructure

Today's Acomplishments

Kublet configuration update applied
Deployed instance of Multicloud Object Gateway

Summary

Reinstalled NFS provisioner (Helm)
Installed OpenShift Data Foundation Operator
Deployed MultiCloud Object Gateway
OpenShift portal is active and cluster appears healthy
Configuring CP4D CLI
Awaiting for final cluster nodes to update through machine config pool

Decisions and Action Items (DAI)

Software evaluation awaiting customer's approval process. This blocks our ability to download software from cp.icr.io
- Customer to escalate internally

Lesons Learned

Preparation for Cloud Pak for Data on OpenShift sizing needed to be adjusted to reflect an under-provisioning of CPU resources
- This was resolved by..

Next Steps

License and configure Cloud Pak for Data
- Cloud Pak Considerations
  - Security scans needed on container images
  - Customer requires on-prem, offline install
  - Customer uses their own container registry that might introduce extra effort or compatability issues
  - Version compatibility with OpenShift (e.g. 4.10 required and customer has 4.11)
  - Supported storage not available
  - Multiple cloudpaks on the same cluster
  - custom connections to data sources not supported OOTB
  - AWS-specific: IAM users required for install/deploy and are not allowed
  - OpenShift specific: CoreOS requirement for control nodes
  - Automatic updating of Cloud Pak, this can interrupt engagements (solution is to always remove update polling from operators)
Deploy watsonx.ai
Application configurations
Application validations

Log 16 🛫

April 3, 2024 · 2 min read

Objective

Deploy watsonx.ai on self-managed AWS infrastructure for customer software evaluation

Milestones

Deploy and configuration of boot node to establish a beach-head into the customer AWS environment
- Complete
Deploy OCP using the documented UPI installation steps
- Complete
Install Cloud Pak for Data
- In Progress
Deploy and configure watsonx.ai on self-managed AWS infrastructure

Summary

Master nodes attempting to upgrade and are in a stuck state preventing rollback of ingress changes
Attempting to deploy a new cluster
New cluster successfully deployed via latest script
Deploying EFS
Installing OpenShift Data Foundation & Standalone Multicloud Object Gateway

Decisions and Action Items (DAI)

Software evaluation awaiting customer's approval process. This blocks our ability to download software from cp.icr.io
- Customer to escalate internally

Next Steps

License and configure Cloud Pak for Data
- Cloud Pak Considerations
  - Security scans needed on container images
  - Customer requires on-prem, offline install
  - Customer uses their own container registry that might introduce extra effort or compatability issues
  - Version compatibility with OpenShift (e.g. 4.10 required and customer has 4.11)
  - Supported storage not available
  - Multiple cloudpaks on the same cluster
  - custom connections to data sources not supported OOTB
  - AWS-specific: IAM users required for install/deploy and are not allowed
  - OpenShift specific: CoreOS requirement for control nodes
  - Automatic updating of Cloud Pak, this can interrupt engagements (solution is to always remove update polling from operators)
Deploy watsonx.ai
Application configurations
Application validations

Log 15 🛫

April 2, 2024 · 2 min read

Objective

Deploy watsonx.ai on self-managed AWS infrastructure for customer software evaluation

Milestones

Deploy and configuration of boot node to establish a beach-head into the customer AWS environment
- Complete
Deploy OCP using the documented UPI installation steps
- Complete
Install Cloud Pak for Data
- In Progress
Deploy and configure watsonx.ai on self-managed AWS infrastructure

Today's Accomplishments

Verifying cluster health in preparation for Cloud Pak for Data install
Verifying network connectivity to application pods

Summary

Attempting to resolve domain name of the OpenShift portal
- Added required cluster-wide settings to proxy
  - Added wxai domain information to noProxy configuration oc edit proxy/cluster
  - Investigating OpenShift error certificate is valid for oauth-openshift.openshift-authentication.svc
  - Cluster pieces updating, validating health of cluster on later call (may be related to current certificate errors)
- More nodes in "ready" state
- 1 master node continuing to update causing cluster connectivity issues
- Attempting to drain and restart pending nodes

Decisions and Action Items (DAI)

Software evaluation awaiting customer's approval process. This blocks our ability to download software from cp.icr.io
- Customer to escalate internally

Next Steps

License and configure Cloud Pak for Data
- Cloud Pak Considerations
  - Security scans needed on container images
  - Customer requires on-prem, offline install
  - Customer uses their own container registry that might introduce extra effort or compatability issues
  - Version compatibility with OpenShift (e.g. 4.10 required and customer has 4.11)
  - Supported storage not available
  - Multiple cloudpaks on the same cluster
  - custom connections to data sources not supported OOTB
  - AWS-specific: IAM users required for install/deploy and are not allowed
  - OpenShift specific: CoreOS requirement for control nodes
  - Automatic updating of Cloud Pak, this can interrupt engagements (solution is to always remove update polling from operators)
Deploy watsonx.ai
Application configurations
Application validations

Log 14 🛫

April 1, 2024 · 2 min read

Objective

Deploy watsonx.ai on self-managed AWS infrastructure for customer software evaluation

Milestones

Deploy and configuration of boot node to establish a beach-head into the customer AWS environment
- Complete
Deploy OCP using the documented UPI installation steps
- Complete
Install Cloud Pak for Data
- In Progress
Deploy and configure watsonx.ai on self-managed AWS infrastructure

Today's Accomplishments

Installed OpenShift for Data CLI on bastion host

Summary

Installed Cloud Pak for Data CLI on bastion host
Attempting to resolve console URL from customer host (laptop), external to cluster and bastion
- Customer is unable to add entries to Windows host file due to local administrator requirements
- Adding hostname resolution for CP4D
Customer doing offline
- Adding resolvable URL for CP4D, allows for proper CP4D application communication (see Action Item)
  - Customer process - sending 'get' requests requires customer security approval process
- Customer to follow 'multi object gateway' instructions
- Customer to follow internal process for software trial evaluation, supporting documentation sent

Decisions and Action Items (DAI)

Software evaluation licenses for CP4D and watsonx.ai
Customer decision is required to determine cluster console access

Next Steps

License and configure Cloud Pak for Data
- Cloud Pak Considerations
  - Security scans needed on container images
  - Customer requires on-prem, offline install
  - Customer uses their own container registry that might introduce extra effort or compatability issues
  - Version compatibility with OpenShift (e.g. 4.10 required and customer has 4.11)
  - Supported storage not available
  - Multiple cloudpaks on the same cluster
  - custom connections to data sources not supported OOTB
  - AWS-specific: IAM users required for install/deploy and are not allowed
  - OpenShift specific: CoreOS requirement for control nodes
  - Automatic updating of Cloud Pak, this can interrupt engagements (solution is to always remove update polling from operators)
Deploy watsonx.ai
Application configurations
Application validations

Log 13 🛫

March 29, 2024 · One min read

Objective

Evaluate Watsonx Discovery vs Chromadb for vector store
Start ideating & testing sample complete workflows

Milestones

Successfully configured working Wx Discovery Vector Store with chunking, embeddings and ELSER semantic search
Connected this KB in WXD to Watsonx Assistant
Developed method to parse & detect RAG responses that give step by step instructions, which can be used to create guided workflows

Lessons learned

Wx Discovery seems to have more accurate chunk retrieval than Chromadb
Wx Discovery out-of-the box chatbot answers are also better
Config is simple and can be done in an hour with prewritten code by just passing in credentials

Next Steps

Continue building guided conversations & enable them to conduct operations autonomously

Log 12 🛫

March 28, 2024 · 3 min read

Objective

Deploy watsonx.ai on self-managed AWS infrastructure for customer software evaluation

Milestones

Deploy and configuration of boot node to establish a beach-head into the customer AWS environment
- Complete
Deploy OCP using the documented UPI installation steps
- In progress
Install Cloud Pak for Data
Deploy and configure watsonx.ai on self-managed AWS infrastructure

Today's Accomplishments

Successful deployment of OpenShift
Successful setup of storage class

Summary

Nodes were shut down after-hours by customer compliance automated scan
- All nodes must be whitelisted by customer security
Validating health of OCP installation
- All nodes started and responding
- Investigating pods
  - Some pods appear to be stuck due to node shutdowns
  - Deleting non-responsive pods
  - Replacing ICMP range 0 with all on 10.0.0.0/8
Issue - ConnectivityCheckController is waiting for transition to desired version (4.12.8) to be completed.
- Investigating proxy configuration
- Adding cluster domain to proxy configuration - configuring local nodes to not use proxy
Fix: Adding noproxy spec to proxy configuration allowed for traffic locally (not through proxy) for nodes
Waiting for configurations to apply (automatically)
OCP cluster verified working

Adding storage to cluster for CP4D support

Creating storage class

# Requires kubeconfig
oc new-project nfs-provisioner
oc config set-context --current --namespace=nfs-provisioner

helm repo add nfs-subdir-external-provisioner https://kubernetes-sigs.github.io/nfs-subdir-external-provisioner/
helm install nfs-subdir-external-provisioner nfs-subdir-external-provisioner/nfs-subdir-external-provisioner \
    --namespace nfs-provisioner \
    --set nfs.server=<EFS URL> \
    --set nfs.path=/ \
    --set storageClass.defaultClass=true

Initial install - Local helm install needed
Retrying by retrieving the external provisioner and copying locally
Default storage class operational
Deleted test pod
Deleted pvx

Lessons learned

OCP Deployment

Customer environment heavily affected configuration of the original deployment script and process
- Security considerations
- Proxy configurations in setup

Decisions and Action Items (DAI)

Software evaluation licenses for CP4D and watsonx.ai
Customer decision is required to determine cluster console access
Add documentation for the CP4D deployment

Next Steps

License and configure Cloud Pak for Data
- Cloud Pak Considerations
  - Security scans needed on container images
  - Customer requires on-prem, offline install
  - Customer uses their own container registry that might introduce extra effort or compatability issues
  - Version compatibility with OpenShift (e.g. 4.10 required and customer has 4.11)
  - Supported storage not available
  - Multiple cloudpaks on the same cluster
  - custom connections to data sources not supported OOTB
  - AWS-specific: IAM users required for install/deploy and are not allowed
  - OpenShift specific: CoreOS requirement for control nodes
  - Automatic updating of Cloud Pak, this can interrupt engagements (solution is to always remove update polling from operators)
Deploy watsonx.ai

Log 11 🛫

March 27, 2024 · 4 min read

Objective

Deploy watsonx.ai on self-managed AWS infrastructure for customer software evaluation

Milestones

Deploy and configuration of boot node to establish a beach-head into the customer AWS environment
- Complete
Deploy OCP using the documented UPI installation steps
- In progress
Install Cloud Pak for Data
Deploy and configure watsonx.ai on self-managed AWS infrastructure

Today's Accomplishments

Summary

Significant progress made in applying the required configurations according to the customer's environment policies
Master and Worker nodes responding

Script Attempts

Cleanup Process

Delete metadata file from "wxai" directory
Delete stacks created by create_cluster_step_2.sh
Remove install state
Ignore first "FATAL" error logged when running create_cluster_step_2.sh

Attempt 1

Communication Issues
1. httpd not running on bootnode/bastion due to previous reboot
- Fix: Enable httpd service on OS. Script change also made to force
1. Egress rules added to bootnode and master
  - ALL AWS default egress connections needed to be manually configured to 10.0.0.0/8 vs AWS default value 0.0.0.0/0 for "all" traffic

Attempt 2

Running OCP process manually (outside of script)
Unable to pull images
- Incorrectly pulling images from the bastion host, should be local registry
- Temporary fix: Run ./start_registy.sh /ibm 5000
- Ignition configuration issue causing error
- Fix: Add deleting state data to the cleanup process

Attempt 3

Starting from scratch
Running cleanup process
Issue found in authentication configuration. The script is improperly configured to more than 1 authentication set
- This customer deployment requires multiple authentication sets: quai.io, RedHat registry, Artifactory. Only one was tested
Using workaround by manually adding the pullsecret to the create_install_config.sh
Running create_cluster_step_1.sh
- Successful
Updated LB DNS configurations manually (to be included in code changes, see Attempt 1)
Running create_cluster_step_2.sh
Stalled - ignitions not firing
- bootnode and master security group needed IP range (additonal egress configuration issues, added code changes, see Attempt 1)
Error
- Issue with Openshift installer extraction. For this customer case, we are not using local registry
- Fix: Use general use OpenShift installer from Redhat, which does not assume local registry

Attempt 4

Running cleanup process
Running create_cluster_step_2.sh
Updated LB DNS configurations
- Worker security group needed validIP range (addition to Attempt 1 and Attempt 2 egress issues, added code changes)
  - Replaced IP ranges 0-0 with 0-65535
Unable to use OpenShift API (oc command) to view pods due to use of untrusted certificates
- Testing workaround use "insecure" connection by adding flag --insecure-skip-tls-verify when using oc
- Example oc get pods -A --insecure-skip-tls-verify
Removed worker node external volumes (1tb) from script configuration

Attempt 5

Retrying using default OpenShift Certificates (bypassing/not creating or using the CA certificate from documentation steps)
- Updated config and removed certificate configuration
Running cleanup process
Running create_cluster_step_2.sh
Witnessing certificate failures in script output, but continuing install
Error: Worker nodes not communicatting
Fix/Root Causes
- Removed IPI artififact from script
- Removed {registry_url} image content sources from imageContentSources (Airgap) from install config.sh
- Removed fips mode from install config.sh

Attempt 6

Running cleanup process
Running additional cleanup steps:
- Removed .kube
- unset $KUBCONFIG
- Delete ignition file
Running create_cluster_step_2.sh
Worker nodes responding
Certificate error resolved
- Potential root cause(s) (see fixes from Attempt 5)
  - IPI artififacts in script
  - {registry_url} image content sources from imageContentSources (Airgap) from install config.sh
  - fips mode from install config.sh
Errors generated from OpenShift to be tracked in next flight log

Decisions and Action Items (DAI)

Validate cluster installation state
Software evaluation licenses for CP4D and watsonx.ai
- Pending approval process

Next Steps

License and configure Cloud Pak for Data
- Cloud Pak Considerations
  - Security scans needed on container images
  - Customer requires on-prem, offline install
  - Customer uses their own container registry that might introduce extra effort or compatability issues
  - Version compatibility with OpenShift (e.g. 4.10 required and customer has 4.11)
  - Supported storage not available
  - Multiple cloudpaks on the same cluster
  - custom connections to data sources not supported OOTB
  - AWS-specific: IAM users required for install/deploy and are not allowed
  - OpenShift specific: CoreOS requirement for control nodes
  - Automatic updating of Cloud Pak, this can interrupt engagements (solution is to always remove update polling from operators)
Deploy watsonx.ai

Log 10 🛫

March 26, 2024 · 5 min read

Objective

Deploy watsonx.ai on self-managed AWS infrastructure for customer software evaluation

Milestones

Deploy and configuration of boot node to establish a beach-head into the customer AWS environment
- Complete
Deploy OCP using the documented UPI installation steps
- In progress
Install Cloud Pak for Data
Deploy and configure watsonx.ai on self-managed AWS infrastructure

Today's Accomplishments

Validation of current deployment status

Verify ‘quay.io’ is the registry in config.sh
- Verified in the registry
Add /usr/local/bin to .bashrc and .bash_profile for root
Create a small instance on a different subnet, same VPC, and confirm that IP can be curled. Make require adjusting the Security Group rules for the bootnode. If 8080 fails, then HTTPD config will need be to changed to port 80 and service restarted
- Initial onnectectivity over port 8080 failed
- Fixed by opening port via security group
Changed certificate organization (O) to match the domain
Cert validated - current certificate using output of openssl x509 -in /ibm/security/certs/ca.crt -text -noout
```
Issuer: C = US, O = ec2.internal, CN = CA
Subject: C = US, O = ec2.internal, CN = CA
```

Changed to

Issuer: C = US, O = `customer domain name`, CN = CA
Subject: C = US, O =  `customer domain name`, CN = CA

Script Attempts

Cleanup Steps

Remove metadata file from "wxai" directory
Delete stacks created
Ignore initial "FATAL" error logged

Attempt 1

Running create_cluster_step_1.sh
- Applied required security tagging, as customer's security scans "remediate" (delete) improperly tagged items
- Depoyment script changes
  - Changed resource types and sizes. Example: gp2 -> gpt3
- Renaming "bootnode" nomenclature to bastion host bastion.'basedomain'
- Renamed certificate organization to match customer domain
Reran create registry script
Proceeded with DNS steps for new Elastic Load Balencer from prevous script

Running create_cluster_step_2.sh

Bootstrap Error

Every Parameters object must contain a Type member
An error occurred (ValidationError) when calling the DescribeStacks operation: Stack with id ibmwxai-6wvkv-bootstrap-stack does not exist

Solution - Needed to add Type string for parameter

BootstrapIgnitionLocation:
Default: s3://my-s3-bucket/bootstrap.ign
Description: Ignition config file location.
Type: String ### This line was not here

Attempt 2

AMI Error (repeat)
- Cause: Customer security team denies all public AMI access
- Fixed: Customer approved public AMI usage (for our specific AMI ID for the CoreOS)

Attempt 3

yaml validation error (new) Parameter validation failed: parameter value for parameter name Master1Subnet does not exist. Rollback requested by user
- Investigated why script is not generating parameter for Master1Subnet
- Fix: Typo found in script create_control_plane_param.sh - masters1ubnet -> master1subnet

Attempt 4

Notified of non-compliance during attempt via email from customer security to customer host
1. Rule Formatting
  - Summary: Automated customer security scan "remediation" removed non-compliant security groups on bootstrap and master (Ingress and Egress)
```
Security Event: Security Group with Unapproved Egress. The security group non-compliant egress rules have been deleted. Please check your application to ensure the functionality has not been negatively impacted.
```
```
Security Event: Security Group with Unapproved Ingress. The security group non-compliant ingress rules have been deleted. Please check your application to ensure the functionality has not been negatively impacted.
```
  - LB template currently assuming public in sg-lb-template.yaml, bootstrap-template.yaml
```
CidrIp: 0.0.0.0/0
```
  - Fix: CidrIps need to be replaced with proper public format per customer security
```
# Replace all instances of 0.0.0.0/0 with
CidrIp: 10.0.0.0/8
```
2. Encryption Requirements
  - All volumes must be encrypted
  - Fix: Cloudforms template must be updated to create encrypted resources

Attempt 5

Redeployment cleanup steps
Running create_cluster_step_1.sh
- Successful
Capture DNS output
Add DNS output to config
Running create_cluster_step_1.sh
Running create_cluster_step_2.sh
- Customer had a hard stop for the day. Awaiting feedback for next session

Decisions and Action Items (DAI)

Adding creation of the ssh key as root user on the bastion node
CoreOS AMI approval from customer (Public AMI's are blocked)
- AMI approved, step 2 script succeeded AMI portion
Customer security policies
- Ongoing: Port rule formatting (Example: Using 10.0.0.0/8 instead of 0.0.0.0/0)
- Cleared: Role authorizations
Software evaluation licenses for CP4D and watsonx.ai
- Pending approval process
Potential Proxy Configuration Error
- Prepared code changes for create_cluster_step_2.sh next session

Next Steps

License and configure Cloud Pak for Data
- Cloud Pak Considerations
  - Security scans needed on container images
  - Customer requires on-prem, offline install
  - Customer uses their own container registry that might introduce extra effort or compatability issues
  - Version compatibility with OpenShift (e.g. 4.10 required and customer has 4.11)
  - Supported storage not available
  - Multiple cloudpaks on the same cluster
  - custom connections to data sources not supported OOTB
  - AWS-specific: IAM users required for install/deploy and are not allowed
  - OpenShift specific: CoreOS requirement for control nodes
  - Automatic updating of Cloud Pak, this can interrupt engagements (solution is to always remove update polling from operators)
  - Deploy watsonx.ai

Log 9 🛫

March 25, 2024 · 3 min read

Objective

Deploy watsonx.ai on self-managed AWS infrastructure for customer software evaluation

Milestones

Deploy and configuration of boot node to establish a beach-head into the customer AWS environment
- Complete
Deploy OCP using the documented UPI installation steps
- In progress
Install Cloud Pak for Data
Deploy and configure watsonx.ai on self-managed AWS infrastructure

Today's Accomplishments

Configuration of the boot node
- Installation of prerequisite software onto the boot node
- Created and started local registry
- Generated CA certrificate for PKI architecture
Completion of step 1 of 2 of the deployment script

Lessons Learned

Storage insufficient on the bootnode for downloaded images, 400gb minimum required
- Mitigation: We increased the boot disk size to 500 gig via the AWS console for the EC2 instance. We then grew the disk and grew the filesystem
- This needs to be added as a prereq
There was a constraint in the sg-lb-template.yaml requiring subnets sized from /16 o /24. We removed that constraint
Edited bootstrap-template.yaml line 91 to remove the wrong key name. (artifact from testing)
Software Evaluation process - define and build internal documentation - TBD
Documentation updates
- Parameter definitions - making them more descriptive
Validation checks
- Creating a validation process before runniing any scripts/installs checking for prerequisites

Decisions and Action Items (DAI)

AWS CLI had a previous installation. Had to manually remove that installation and re-run the aws cli install command.
We decided to run the installation as root user. Root user needed to have the /usr/local/bin added to the PATH. Did this manually on the fly with an export command.
Customer security to approve selected AMI for coreOS
- The AMI for CoreOS is a public AMI. The customer security team needs to copy it into the dev account as public AMI's are blocked in this environment

Next Steps

License and configure Cloud Pak for Data
- Cloud Pak Considerations
  - Security scans needed on container images
  - Customer has no OpenShift experience
  - Customer requires on-prem, offline install
  - Customer uses their own container registry that might introduce extra effort or compatability issues
  - Version compatibility with OpenShift (e.g. 4.10 required and customer has 4.11)
  - Supported storage not available
  - Multiple cloudpaks on the same cluster
  - custom connections to data sources not supported OOTB
  - AWS-specific: IAM users required for install/deploy and are not allowed
  - OpenShift specific: CoreOS requirement for control nodes
  - Automatic updating of Cloud Pak, this can interrupt engagements (solution is to always remove update polling from operators)
Deploy watsonx.ai

Log 8 🛫

March 22, 2024 · 2 min read

Objectives

Deploy watsonx.ai on self-managed AWS infrastructure.

Accomplishments

AWS

Fixes to cluster-sts.yaml and other deployment resources.
- Fixed error in cluster-sts.yml by commenting out lines 590-599.
- Changed IamInstanceProfile: !Ref BootnodeInstanceProfile to IamInstanceProfile: <InstanceProfileName>
- Changed SubnetId: !Ref PublicSubnet1ID to SubnetId: <PrivateSubnetID> to account for private deployments
- Updated LambdaExecutionRole.json line 14: from ec2.aws.com to lambda.aws.com and added cloudformation.aws.com of allowed services.
- Fixed LambdaExecutionRole ARN to proper role name.
- Commented out /bin/bash ./cp-deploy.sh env apply -e env_id=${ClusterName} [--accept-all-licenses]
- Added VPC and Subnet IDs to the “CleanupLambda” lambda function in cluster-sts, which then required adding “ec2:CreateNetworkInterface” permission to LambdaExecutionRole
- Adding tags to CleanupLambda with Application IDs.
Successful deployment of BootNode instance.

RAG

Creation of cronjob to capture logs from Python app.
Enabled metadata insertion into chunks in vector store -> (hopefully) increases retrieval accuracy
Return context to user (shows sources used to generate responses)
Added mixtral model support
Enable functionality for user to give custom rag parameters
Migrated vector DB from FAISS to chromaDB to enable the metadata functionality
Script written to easily test rag implementation and save results in csv
Implemented cache logic to make sure it considers combination of parameters as well before chosing to send a cached response
Added better logic for caching to improve performance
Remove unwanted parameters from request body

In Progress

End-to-end deployment of OCP, CP4D, and watsonx.ai (with GPU node)
Tagging cp-deployer.sh generated resources.
Updating solution docs with better asset linking.
Exploring WatsonX Discovery

Next Steps

Continue over the shoulder working sessions
- Kick off CloudFormation template install with updated STS templates.
Compilation of required endpoints
Deploy latest RAG version on AWS
Build out actions & flow in Watsonx Assistant after properly defining personas & objectives.
Kick off Cloud Pak for Deployment entitlement key.
Build RAG application using WatsonX Discovery.
Compare WatsonX Discovery RAG with existing RAG results.

Tracking (Issues)

Require sign-off on final CloudFormation template.
Red Hat CoreOS AMI pending approval.
LambdaCleanup error from not being able to assume role.
Double checking role names in Cloudformation template.

Log 7 🛫

March 20, 2024 · 2 min read

Objectives

Deploy watsonx.ai on self-managed AWS infrastructure.

Accomplishments

AWS

Shift from CP Deployer to OpenShift UPI deployment.
Artifactory proxy details procured.
Discussion of on-site logistics
RHEL 8 AMI changed for BootNode.

RAG

Creation of cronjob to capture logs from Python app.
Enabled metadata insertion into chunks in vector store -> (hopefully) increases retrieval accuracy
Return context to user (shows sources used to generate responses)
Added mixtral model support
Enable functionality for user to give custom rag parameters
Migrated vector DB from FAISS to chromaDB to enable the metadata functionality
Script written to easily test rag implementation and save results in csv
Implemented cache logic to make sure it considers combination of parameters as well before chosing to send a cached response

In Progress

End-to-end deployment of OCP, CP4D, and watsonx.ai (with GPU node)
Tagging cp-deployer.sh generated resources.
Updating solution docs with better asset linking.

Next Steps

Setup bootnode with necessary downloads and resources.
Creation of IAM Role request creation Cloudformation templates.
Kick off on-site over the shoulder working sessions.
Collating information and resources to be created via OpenShift UPI deployment.
Setup Artifactory proxy.
Kick off Cloud Pak for Deployment entitlement key.
Deploy latest RAG version on AWS
Build out actions & flow in Watsonx Assistant after properly defining personas & objectives.

Tracking (Issues)

Require sign-off on final CloudFormation template.
Red Hat CoreOS AMI still pending approval.

Log 6 🛫

March 18, 2024 · 2 min read

Objectives

Deploy watsonx.ai on self-managed AWS infrastructure.

Accomplishments

AWS

Fixes to cluster-sts.yaml and other deployment resources.
- Fixed error in cluster-sts.yml by commenting out lines 590-599.
- Changed IamInstanceProfile: !Ref BootnodeInstanceProfile to IamInstanceProfile: <InstanceProfileName>
- Changed SubnetId: !Ref PublicSubnet1ID to SubnetId: <PrivateSubnetID> to account for private deployments
- Updated LambdaExecutionRole.json line 14: from ec2.aws.com to lambda.aws.com and added cloudformation.aws.com of allowed services.
- Fixed LambdaExecutionRole ARN to proper role name.
- Commented out /bin/bash ./cp-deploy.sh env apply -e env_id=${ClusterName} [--accept-all-licenses]
- Added VPC and Subnet IDs to the “CleanupLambda” lambda function in cluster-sts, which then required adding “ec2:CreateNetworkInterface” permission to LambdaExecutionRole
- Adding tags to CleanupLambda with Application IDs.
Successful deployment of BootNode instance.

RAG

Creation of cronjob to capture logs from Python app.
Enabled metadata insertion into chunks in vector store -> (hopefully) increases retrieval accuracy
Return context to user (shows sources used to generate responses)
Added mixtral model support
Enable functionality for user to give custom rag parameters
Migrated vector DB from FAISS to chromaDB to enable the metadata functionality
Script written to easily test rag implementation and save results in csv
Implemented cache logic to make sure it considers combination of parameters as well before chosing to send a cached response

In Progress

End-to-end deployment of OCP, CP4D, and watsonx.ai (with GPU node)
Tagging cp-deployer.sh generated resources.
Updating solution docs with better asset linking.

Next Steps

Continue over the shoulder working sessions
- Kick off CloudFormation template install with updated STS templates.
Compilation of required endpoints
Deploy latest RAG version on AWS
Build out actions & flow in Watsonx Assistant after properly defining personas & objectives.
Kick off Cloud Pak for Deployment entitlement key.

Tracking (Issues)

Require sign-off on final CloudFormation template.
Red Hat CoreOS AMI pending approval.
LambdaCleanup error from not being able to assume role.
Double checking role names in Cloudformation template.

Log 5 🛫

March 15, 2024 · 2 min read

Objectives

Deploy watsonx.ai on self-managed AWS infrastructure.

Accomplishments

AWS

Populated parameter overrides JSON.
Created RH Trial account and uploaded pull secret to S3 bucket.
Updated CloudFormation STS template with permissions to create and assume Role with respective JSON versions.

RAG

Creation of cronjob to capture logs from Python app.
Enabled metadata insertion into chunks in vector store -> (hopefully) increases retrieval accuracy
Return context to user (shows sources used to generate responses)
Added mixtral model support
Enable functionality for user to give custom rag parameters
Migrated vector DB from FAISS to chromaDB to enable the metadata functionality
Script written to easily test rag implementation and save results in csv
Implemented cache logic to make sure it considers combination of parameters as well before chosing to send a cached response

In Progress

End-to-end deployment of OCP, CP4D, and watsonx.ai (with GPU node)
Tagging cp-deployer.sh generated resources.
Updating solution docs with better asset linking.

Next Steps

Continue over the shoulder working sessions
- Kick off CloudFormation template install with updated STS templates.
Compilation of required endpoints
Deploy latest RAG version on AWS
Build out actions & flow in Watsonx Assistant after properly defining personas & objectives.

Tracking (Issues)

Require sign-off on final CloudFormation template.
CoreOS AMI pending approval.

Log 4 🛫

March 13, 2024 · 2 min read

Objectives

Deploy watsonx.ai on self-managed AWS infrastructure.

Accomplishments

AWS

Reviewed list of missing values able to be added to role via Policy
Sent parameter overrides list to be populated for CloudFormation template installation.
Creation of three separate CloudFormation template Roles.
Updated CloudFormation templates to use STS.

RAG

Creation of cronjob to capture logs from Python app.
Enabled metadata insertion into chunks in vector store -> (hopefully) increases retrieval accuracy
Return context to user (shows sources used to generate responses)
Added mixtral model support
Enable functionality for user to give custom rag parameters
Migrated vector DB from FAISS to chromaDB to enable the metadata functionality
Script written to easily test rag implementation and save results in csv
Implemented cache logic to make sure it considers combination of parameters as well before chosing to send a cached response

In Progress

End-to-end deployment of OCP, CP4D, and watsonx.ai (with GPU node)
Tagging cp-deployer.sh generated resources.

Next Steps

Continue over the shoulder working sessions
- Kick off CloudFormation template install
Compilation of required endpoints
Fill out required network values required for OCP deployment.
Deploy latest RAG version on AWS
Build out actions & flow in Watsonx Assistant after properly defining personas & objectives.

Tracking (Issues)

Require sign-off on final CloudFormation template.
Getting access to CoreOS AMI.

Log 3 🛫

March 11, 2024 · One min read

Objectives

Deploy watsonx.ai on self-managed AWS infrastructure.

Accomplishments

AWS

Discovery of AWS DevOps role to be used and augmented with permissions.
Adjusted check-permissions.sh script to account for profile to be passed.
Creation of Cloudformation templates for roles with permissions needed for install.
- Added --profile and $PROFILE_NAME
Adjusted Cloudformation templates to account for roles instead of a user.

RAG

App deployed on Fyre VM
Support for granitev2/llama2 70 b chat models added.
Watsonx Assistant Configured to interact with app via API for easier testing.

In Progress

End-to-end deployment of OCP, CP4D, and watsonx.ai (with GPU node)
Tagging cp-deployer.sh generated resources.
Test out RAG new chunking method.

Next Steps

Continue over the shoulder working sessions
Compilation of required endpoints
Fill out required network values required for OCP deployment.
Add Mixtral model to RAG.
Deploy latest RAG version on AWS
Build out actions & flow in Watsonx Assistant after properly defining personas & objectives.

Tracking (Issues)

Require sign-off on final CloudFormation template.

Log 2 🛫

March 8, 2024 · 2 min read

Objectives

Deploy watsonx.ai on self-managed AWS infrastructure.

Accomplishments

AWS

Established cocreation working cadence and cocreation point of contact.
Provided list of required Role permissions.
Successful deployment of OCP, CP4D via Cloudformation template via Console with the following:
- Creation of 3 Public and 3 Public Subnets and NAT gateways via Cloudformation template
- 3x m5.2xlarge master nodes
- 6x m6i.8xlarge worker nodes
Successful deployment of Cloudformation template via CLI using:
- A parameter overrides json file
- Tested only the necessary required permissions for deployment
Created cloudformation template to create a role with exact permissions to run Cloudformation deployment.
Tagging of resources created by Cloudformation template.

RAG

App deployed on Fyre VM
Support for granitev2/llama2 70 b chat models added.
Watsonx Assistant Configured to interact with app via API for easier testing.

In Progress

End-to-end deployment of OCP, CP4D, and watsonx.ai (with GPU node)
Tagging cp-deployer.sh generated resources.
Test out RAG new chunking method.

Next Steps

Begin over the shoulder working sessions
Compilation of required endpoints
Add Mixtral model to RAG.
Deploy latest RAG version on AWS.
Build out actions & flow in Watsonx Assistant after properly defining personas & objectives.

Tracking (Issues)

Require sign-off on final CloudFormation template.

Log 1 🛫

February 29, 2024 · One min read

Objectives

Deploy watsonx.ai on self-managed AWS infrastructure.

Accomplishments

Established this GitHub repository as single source of truth for architecture, IaC, and documentation to collaborate with stakeholders.
Developed draft CloudFormation template to provision AWS resources.
Started incorporating STS into CloudFormation.

In Progress

Awaiting approval for AWS credits to cover infrastructure costs. Following up to expedite.
Finalizing deployment plan and cadence.

Next Steps

Review deployment details in working session with stakeholders.
Incorporate additional feedback into documentation and IaC templates.
Upon AWS credit approval and stakeholder sign-off, begin provisioning.

Tracking (Issues)

Need confirmation of AWS credit approval.
Require sign-off on final CloudFormation template.
Align on deployment cadence with customer.

Objective​

Milestones​

Today's Accomplishments​

Summary​

Decisions and Action Items (DAI)​

Lessons Learned​

Next Steps​

Objective​

Milestones​

Today's Accomplishments​

Summary​

Decisions and Action Items (DAI)​

Lessons Learned​

Next Steps​

Objective​

Milestones​

Today's Accomplishments​

Summary​

Decisions and Action Items (DAI)​

Lessons Learned​

Next Steps​

Objective​

Milestones​

Summary​

Decisions and Action Items (DAI)​

Lessons Learned​

Next Steps​

Objective​

Milestones​

Today's Accomplishments​

Summary​

Decisions and Action Items (DAI)​

Lessons Learned​

Next Steps​

Objective​

Milestones​

Summary​

Decisions and Action Items (DAI)​

Lessons Learned​

Next Steps​

Objective​

Milestones​

Summary​

Decisions and Action Items (DAI)​

Lessons Learned​

Next Steps​

Objective​

Milestones​

Summary​

Decisions and Action Items (DAI)​

Lessons Learned​

Next Steps​

Objective​

Milestones​

Summary​

Decisions and Action Items (DAI)​

Lessons Learned​

Next Steps​

Objective​

Milestones​

Today's Accomplishments​

Summary​

Decisions and Action Items (DAI)​

Lessons Learned​

Next Steps​

Objective​

Milestones​

Today's Accomplishments​

Summary​

Decisions and Action Items (DAI)​

Lessons Learned​

Next Steps​

Objective​

Milestones​

Today's Accomplishments​

Summary​

Decisions and Action Items (DAI)​

Lessons Learned​

Next Steps​

Objective​

Objective

Milestones

Today's Accomplishments

Summary

Decisions and Action Items (DAI)

Lessons Learned

Next Steps

Objective

Milestones

Today's Accomplishments

Summary

Decisions and Action Items (DAI)

Lessons Learned

Next Steps

Objective

Milestones

Today's Accomplishments

Summary

Decisions and Action Items (DAI)

Lessons Learned

Next Steps

Objective

Milestones

Summary

Decisions and Action Items (DAI)

Lessons Learned

Next Steps

Objective

Milestones

Today's Accomplishments

Summary

Decisions and Action Items (DAI)

Lessons Learned

Next Steps

Objective

Milestones

Summary

Decisions and Action Items (DAI)

Lessons Learned

Next Steps

Objective

Milestones

Summary

Decisions and Action Items (DAI)

Lessons Learned

Next Steps

Objective

Milestones

Summary

Decisions and Action Items (DAI)

Lessons Learned

Next Steps

Objective

Milestones

Summary

Decisions and Action Items (DAI)

Lessons Learned

Next Steps

Objective

Milestones

Today's Accomplishments

Summary

Decisions and Action Items (DAI)

Lessons Learned

Next Steps

Objective

Milestones

Today's Accomplishments

Summary

Decisions and Action Items (DAI)

Lessons Learned

Next Steps

Objective

Milestones

Today's Accomplishments

Summary

Decisions and Action Items (DAI)

Lessons Learned

Next Steps

Objective