Skip to main content

Log 35 πŸ›«

Β· One min read

Objective​

Deploy watsonx.ai on self-managed AWS infrastructure for customer software evaluation

Milestones​

  1. Deploy and configuration of boot node to establish a beach-head into the customer AWS environment
    • Complete
  2. Deploy OCP using the documented UPI installation steps
    • Complete
  3. Install Cloud Pak for Data
    • Complete
  4. Deploy and configure watsonx.ai on self-managed AWS infrastructure on ref environment and document
    • In Progress

Today's Accomplishments​

  • Further configuration of watsonx Assistant

Summary​

  • Configuring NeuralSeek
    • Error occuring during testing "Error, please refresh the page and try again"
      • Issue found - old API key being used in the configuration due to unexpected API key retainment in configuration output for customer
  • Issue has been resolved, continuing configuration of NeuralSeek

Decisions and Action Items (DAI)​

  • Investigation of cluster proxy configuration (post POC)
  • Provide a demonstration of ServiceNow/Outlook integration via Orchestrate and Assistant

Lessons Learned​

  • In NeuralSeek, the API key is retained in configuration exports. The latest configuration had an old key

Next Steps​

  • Application verification

Log 34 πŸ›«

Β· One min read

Objective​

Deploy watsonx.ai on self-managed AWS infrastructure for customer software evaluation

Milestones​

  1. Deploy and configuration of boot node to establish a beach-head into the customer AWS environment
    • Complete
  2. Deploy OCP using the documented UPI installation steps
    • Complete
  3. Install Cloud Pak for Data
    • Complete
  4. Deploy and configure watsonx.ai on self-managed AWS infrastructure on ref environment and document
    • In Progress

Today's Accomplishments​

  • Further configuration of watsonx Assistant

Summary​

  • Pivoting from external integrations
    • Unable to configure proxy to reach out to external sites from OCP
  • Continuing configuration of assistant actions and non-extenral connections
    • Assistant configured
  • Continuing configuration of NeuralSeek

Decisions and Action Items (DAI)​

  • Investigation of cluster proxy configuration (post POC)
  • Provide a demonstration of ServiceNow/Outlook integration via Orchestrate and Assistant

Lessons Learned​

  • Proxy configuration for cluster required for application access to external sources - To be investigated post-POC

Next Steps​

  • Application verification

Log 33 πŸ›«

Β· 2 min read

Objective​

Deploy watsonx.ai on self-managed AWS infrastructure for customer software evaluation

Milestones​

  1. Deploy and configuration of boot node to establish a beach-head into the customer AWS environment
    • Complete
  2. Deploy OCP using the documented UPI installation steps
    • Complete
  3. Install Cloud Pak for Data
    • Complete
  4. Deploy and configure watsonx.ai on self-managed AWS infrastructure on ref environment and document
    • In Progress

Today's Accomplishments​

Summary​

  • Brief watsonx.ai Prompt Lab demonstration
  • Reconfiguring watsonx Assistant front end (per customer request, to be worked on)
  • Continuing proxy investigation for the cluster

Decisions and Action Items (DAI)​

  • ServiceNow connectivity being investigated
    • Issue with proxy configuration not allowing watsonx Assistant/Orchestrate communication with ServiceNow.com
  • watsonx Assistant front-end enhancements with customer
    • Chat greeting enhancements

Lessons Learned​

  • pdf files needed for watson Assistant extensions available to the cluster internally (no external access or configurable access to S3 buckets)
    • Workaround: Hosted pdf files on the bastion httpd server (originally used for the OCP ignition files)

Next Steps​

  • Application configuration
    • watsonx.ai Prompt Lab
    • watsonx Assistant
    • watsonx Orchestrate
      • ServiceNow skills
      • Microsoft Outlook skills

Log 32 πŸ›«

Β· 2 min read

Objective​

Deploy watsonx.ai on self-managed AWS infrastructure for customer software evaluation

Milestones​

  1. Deploy and configuration of boot node to establish a beach-head into the customer AWS environment
    • Complete
  2. Deploy OCP using the documented UPI installation steps
    • Complete
  3. Install Cloud Pak for Data
    • Complete
  4. Deploy and configure watsonx.ai on self-managed AWS infrastructure on ref environment and document
    • In Progress

Summary​

  • Attempting proxy reconfiguration for the cluster
    # Customer data redacted
    kubectl get pods -n cpd -o json | jq -r β€˜.items[].metadata.labels[β€œapp.kubernetes.io/instance”]’ | sort | uniq
    cpd-cli manage enable-rsi --cpd_instance_ns=${PROJECT_CPD_INST_OPERANDS}
    cat << EOF >> cpfs-proxy-envs.json
    [ {
    β€œname”: β€œHTTP_PROXY”,
    β€œvalue”: β€œhttp://http.proxy.customer.com:8000/”
    }, {
    β€œname”: β€œhttp_proxy”,
    β€œvalue”: β€œhttp://http.proxy.customer.com:8000/”
    }, {
    β€œname”: β€œHTTPS_PROXY”,
    β€œvalue”: β€œhttp://http.proxy.customer.com:8000/”
    }, {
    β€œname”: β€œhttps_proxy”,
    β€œvalue”: β€œhttp://http.proxy.customer.com:8000/”
    }, {
    β€œname”: β€œNO_PROXY”,
    β€œvalue”: β€œ.aws-nonprod.xxx.com,.ibm-wxai.aws-nonprod.customer.com.apps.ibm-wxai.aws-nonprod.xxx.com,172.30.0.0/16,10.128.0.0/14,10.19.170.0/25”
    }, {
    β€œname”: β€œno_proxy”,
    β€œvalue”: β€œ.aws-nonprod.xxx.com,.ibm-wxai.aws-nonprod.customer.com.apps.ibm-wxai.aws-nonprod.xxx.com,172.30.0.0/16,10.128.0.0/14,10.19.170.0/25”
    } ]
    EOF
    cpd-cli manage create-rsi-patch \
    --cpd_instance_ns=cpd \
    --patch_name=β€œcpfs-proxy” \
    --description=β€œadd proxy settings to CPD” \
    --patch_type=rsi_pod_env_var \
    --patch_spec=/tmp/work/rsi/cpfs-proxy-envs.json \
    --spec_format=set-env \
    --select_all_pods=true \
    --include_labels=wa \
    --exclude_labels=app:rsi \
    --skip_apply=false \
    --state=active
    • New configuration requires watsonx Assistant pod reboot, waiting until after further application configurations
    • Continuing to configure NeuralSeek assistant

Decisions and Action Items (DAI)​

  • ServiceNow connectivity being investigated
    • Issue with proxy configuration not allowing watsonx Assistant/Orchestrate communication with ServiceNow.com

Lessons Learned​

  • pdf files needed for watson Assistant extensions available to the cluster internally (no external access or configurable access to S3 buckets)
    • Workaround: Hosted pdf files on the bastion httpd server (originally used for the OCP ignition files)

Next Steps​

  • Application configuration
    • watsonx.ai Prompt Lab
    • watsonx Assistant
    • watsonx Orchestrate
      • ServiceNow skills
      • Microsoft Outlook skills

Log 31 πŸ›«

Β· 2 min read

Objective​

Deploy watsonx.ai on self-managed AWS infrastructure for customer software evaluation

Milestones​

  1. Deploy and configuration of boot node to establish a beach-head into the customer AWS environment
    • Complete
  2. Deploy OCP using the documented UPI installation steps
    • Complete
  3. Install Cloud Pak for Data
    • Complete
  4. Deploy and configure watsonx.ai on self-managed AWS infrastructure on ref environment and document
    • In Progress

Today's Accomplishments​

  • Successful configuration of wastsonx Assistant with PDF lookups
  • Successful validation of NeuralSeek souce attribution

Summary​

  • TS016344977 opened with IBM support to investigate watson discovery EDB cluster issue
    • Watson Discovery has an EDB cluster that has only 1/2 pods running, with one pod stuck in CrashLoopBackoff
    • Troubleshooting with support
    • Support identified an issue with the Postgress Database
      • Postgress database not running putting the cluster in an unhealthy state
        {"level": "info", "ts":"2024-05-29119:39:062", "logger": "pg_rewind", "msg": "pg_rewind: connected to server", "pipe": "stderr", "logging-pod": "wd-discovery-cn-postgres-1"}
        {"level": "info", "ts":"2024-05-29119:39:062", "logger": "pq rewind", "msg": "pg_rewind: fatal: target server must be shut down cleanly" "pipe":"stderr" "logging_pod": "wd-discovery-cn-postqres-1"}
      • Engaging Postgress support
  • Continuing watsonx Assistant configuration
    • Hosting response PDFs on the bastion httpd server
    • wastsonx Assistant with PDF lookup configured
  • Continuing NeuralSeek configuration
    • Test questions verified
  • Retrying ServiceNow Skills
    • Configuring proxy in the skill yaml/json
    • Same timeout error when trying skill, continuing investigation

Decisions and Action Items (DAI)​

  • ServiceNow connectivity being investigated
    • Issue with proxy configuration not allowing watsonx Assistant/Orchestrate communication with ServiceNow.com

Lessons Learned​

  • pdf files needed for watson Assistant extensions available to the cluster internally (no external access or configurable access to S3 buckets)
    • Workaround: Hosted pdf files on the bastion httpd server (originally used for the OCP ignition files)

Next Steps​

  • Application configuration
    • watsonx.ai Prompt Lab
    • watsonx Assistant
    • watsonx Orchestrate
      • ServiceNow skills
      • Microsoft Outlook skills

Log 30 πŸ›«

Β· 2 min read

Objective​

Deploy watsonx.ai on self-managed AWS infrastructure for customer software evaluation

Milestones​

  1. Deploy and configuration of boot node to establish a beach-head into the customer AWS environment
    • Complete
  2. Deploy OCP using the documented UPI installation steps
    • Complete
  3. Install Cloud Pak for Data
    • Complete
  4. Deploy and configure watsonx.ai on self-managed AWS infrastructure on ref environment and document
    • In Progress

Summary​

  • Testing NeuralSeek functionality
  • NeuralSeek verified opperational
  • Configuring watsonx Assistant ServiceNow extension
    • Unable to access ServiceNow, getting blocked outbound
    • Investigating AWS and Cluster network connectivity
      • Curl requests working from bastion, not from application or cluster
    • Issue found: Customer proxy needs to be used in all API requests (even though cluster is configured to use proxy)
      • Proposed workaround: Add proxy to OpenAPI spec for watsonx Assistant and Orchestrate
    • Fix: Configure cluster environments to use customer proxy
      oc set env deployment --all http_proxy=http://my_proxy:my_port
      oc set env deployment --all https_proxy=http://my_proxy:my_port

Decisions and Action Items (DAI)​

  • ServiceNow connectivity being investigated
    • Issue with proxy configuration not allowing watsonx Assistant/Orchestrate communication with ServiceNow.com

Lessons Learned​

  • If a proxy is required, the proxy configuration needs to be applied to the deployment environments for cluster applications to use the proxy. Initially the proxy was only applied to the cluster itself, not the deployment environments.

Next Steps​

  • Application configuration
    • NeuralSeek
    • watsonx.ai Prompt Lab
    • watsonx Assistant
    • watsonx Orchestrate
      • ServiceNow skills
      • Microsoft Outlook skills

Log 29 πŸ›«

Β· 2 min read

Objective​

Deploy watsonx.ai on self-managed AWS infrastructure for customer software evaluation

Milestones​

  1. Deploy and configuration of boot node to establish a beach-head into the customer AWS environment
    • Complete
  2. Deploy OCP using the documented UPI installation steps
    • Complete
  3. Install Cloud Pak for Data
    • Complete
  4. Deploy and configure watsonx.ai on self-managed AWS infrastructure on ref environment and document
    • In Progress

Summary​

  • Continuing to investigate NeuralSeek configuration / CP4D
    • Support team reviewing CP4D install and connectivity
    • Potential issues found on customer network blocking API connectivity Run port forward service on port 8888
      oc port-forward service/ibm-nginx-svc 8888:443
      From a new terminal on the same node:
      ss -tlnp
      Verified 8888 listenting on localhost
      # Custom support js
      curl -k -LO https://127.0.0.1:8888/common-nav/api/nls/login-nls.js
      Output shows Pod recieving required files Something in between the pods that disallows authenticated traffic
    • Customer to engage network team to assess / log network traffic via AWS
  • Issue resolved
    • There were eronious EC2 instances in the AWS load balencer from previous configurations that interfered with cluster communication
    • Issue was not constant and only appeared when network traffic was sent to an incorrect destination (non-existant) by the load balencer
  • Ready to continue application verification

Decisions and Action Items (DAI)​

  • None pending

Lessons Learned​

  • Cleaning previous networking configurations

Next Steps​

  • Application configuration
    • NeuralSeek
    • watsonx.ai Prompt Lab
    • watsonx Assistant
    • watsonx Orchestrate
      • ServiceNow skills
      • Microsoft Outlook skills

Log 28 πŸ›«

Β· 2 min read

Objective​

Deploy watsonx.ai on self-managed AWS infrastructure for customer software evaluation

Milestones​

  1. Deploy and configuration of boot node to establish a beach-head into the customer AWS environment
    • Complete
  2. Deploy OCP using the documented UPI installation steps
    • Complete
  3. Install Cloud Pak for Data
    • Complete
  4. Deploy and configure watsonx.ai on self-managed AWS infrastructure on ref environment and document
    • In Progress

Summary​

  • Continuing to investigate NeuralSeek configuration / CP4D
    • Reset all pods that are used in the CP4D login flow using the following commands:
      oc delete pod -n cpd -l component=ibm-nginx
      oc delete pod -n cpd -l component=usermgmt
      oc delete pod -n cpd -l component=zen-core
      oc delete pod -n cpd -l component=zen-core-api
    • Confirmed metastore-db is also healthy using this command:
      oc get cluster.postgresql.k8s.enterprisedb.io zen-metastore-edb -n cpd
    • Still unable to login and browse CP4D consistently
    • Errors found in usermgmt pods that appear after failed page loads in CP4D
    • Successful use of APIs to query CP4D for a Bearer Token, to list project IDs, and to even generate text from WatsonX
    • Neuralseek was able to query WatsonX, however it couldnt interact with Watson Discovery.
    • Support ticket was opened previously, updated this morning to include screenshots of the error logs found in usermgmt pods. Ticket has been upgraded to Sev1
    • Pods verified
    • Engaging support

Decisions and Action Items (DAI)​

  • None pending

Lessons Learned​

  • None today

Next Steps​

  • Application configuration
    • NeuralSeek
    • watsonx.ai Prompt Lab
    • watsonx Assistant
    • watsonx Orchestrate
      • ServiceNow skills
      • Microsoft Outlook skills

Log 27 πŸ›«

Β· One min read

Objective​

Deploy watsonx.ai on self-managed AWS infrastructure for customer software evaluation

Milestones​

  1. Deploy and configuration of boot node to establish a beach-head into the customer AWS environment
    • Complete
  2. Deploy OCP using the documented UPI installation steps
    • Complete
  3. Install Cloud Pak for Data
    • Complete
  4. Deploy and configure watsonx.ai on self-managed AWS infrastructure on ref environment and document
    • In Progress

Summary​

  • Investigating NeuralSeek configuration
    • Curl request from application and bastion failing
    • Nodes missing machineconfigs, waiting for machineconfigs to propegate through all nodes

Decisions and Action Items (DAI)​

  • None pending

Lessons Learned​

  • None today

Next Steps​

  • Application configuration
    • NeuralSeek
    • watsonx.ai Prompt Lab
    • watsonx Assistant
    • watsonx Orchestrate
      • ServiceNow skills
      • Microsoft Outlook skills

Log 26 πŸ›«

Β· One min read

Objective​

Deploy watsonx.ai on self-managed AWS infrastructure for customer software evaluation

Milestones​

  1. Deploy and configuration of boot node to establish a beach-head into the customer AWS environment
    • Complete
  2. Deploy OCP using the documented UPI installation steps
    • Complete
  3. Install Cloud Pak for Data
    • Complete
  4. Deploy and configure watsonx.ai on self-managed AWS infrastructure on ref environment and document
    • In Progress

Today's Accomplishments​

  • Installed NeuralSeek
  • Installed watsonx Orchestrate

Summary​

  • Installing NeuralSeek
  • S3 Bucket Configured
  • Installing watsonx Orchestrate
    • Removing/recreating MongoDB pod
    • Applied required patch from 'known issues' list (IBM internal documentation)
      INSTANCE_NAME=$(oc -n <cpd-namespace> get wa --output jsonpath='{.items[0].metadata.name}')
      oc patch wa ${INSTANCE_NAME} --type='merge' -p='{"configOverrides":{"store":{"extra_vars": {"store": {"MAX_NEW_IA_ASSISTANTS":"1000","MAX_NEW_IA_SKILLS":"300000","ASSISTANT_MAX_PAGE_LIMIT":"1000"}}}}}'

Decisions and Action Items (DAI)​

  • None pending

Lessons Learned​

  • None today

Next Steps​

  • Application configuration
    • NeuralSeek
    • watsonx.ai Prompt Lab
    • watsonx Assistant
    • watsonx Orchestrate
      • ServiceNow skills
      • Microsoft Outlook skills

Log 25 πŸ›«

Β· One min read

Objective​

Deploy watsonx.ai on self-managed AWS infrastructure for customer software evaluation

Milestones​

  1. Deploy and configuration of boot node to establish a beach-head into the customer AWS environment
    • Complete
  2. Deploy OCP using the documented UPI installation steps
    • Complete
  3. Install Cloud Pak for Data
    • Complete
  4. Deploy and configure watsonx.ai on self-managed AWS infrastructure on ref environment and document
    • In Progress

Today's Accomplishments​

  • AppConnect installed

Summary​

  • Installing AppConnect
  • Installing Watsonx Orchestrate (Includes assistant instance)
    • Customer continuing install after session

Decisions and Action Items (DAI)​

  • None pending

Lessons Learned​

  • None today

Next Steps​

  • Deploy watsonx.ai
  • Install NeuralSeek
  • Application configuration
    • NeuralSeek
    • watsonx.ai Prompt Lab
    • watsonx Assistant
    • watsonx Orchestrate
      • ServiceNow skills
      • Microsoft Outlook skills

Log 24 πŸ›«

Β· 2 min read

Objective​

Deploy watsonx.ai on self-managed AWS infrastructure for customer software evaluation

Milestones​

  1. Deploy and configuration of boot node to establish a beach-head into the customer AWS environment
    • Complete
  2. Deploy OCP using the documented UPI installation steps
    • Complete
  3. Install Cloud Pak for Data
    • Complete
  4. Deploy and configure watsonx.ai on self-managed AWS infrastructure on ref environment and document
    • In Progress

Today's Accomplishments​

  • Successful installation of Llama model

Summary​

  • Installing and troubleshooting remaining LLMs for watsonx.ai
    • Granite model installed (previous session)
    • Issues with installation of Mistral and Llama models
      • Investigating Mistral model compatibility
        • Determined incompatible with p4d.24xlarge a100-40gb
      • Investigating work around for instance type compatability for Llama model p4d.24xlarge a100-40gb in US-East-2
  • Llama installed and running using workaround (Workaround source https://github.ibm.com/NGP-TWC/ml-planning/issues/37189)
    #Add the following to watsonxaiifm-cr:
    meta_llama_llama_2_70b_chat:
    deployment_yaml_name: llama-2-70b-chat.yaml.j2
    pvc_name: meta-llama-llama-2-70b-chat-pvc
    svc_name: llama-2-70b-chat
    pvc_size: 150Gi
    dir_name: models--meta-llama--llama-2-70b-chat
    model_name: /watsonxaiifm-models/models--meta-llama--llama-2-70b-chat
    cuda_visible_devices: 0,1,2,3,4,5,6,7
    model_root_dir: /watsonxaiifm-models
    flash_attention: true
    deployment_framework: hf_custom_tp
    dtype_str: float16
    max_batch_size: 128
    max_concurrent_requests: 160
    max_batch_weight: 200000
    max_sequence_length: 4096
    max_prefill_weight: 60000
    max_new_tokens: 4096
    num_shards: 8
    hf_modules_cache: /tmp/huggingface/modules
    force_apply: no
    meta_llama_llama_2_70b_chat_resources:
    limits:
    cpu: β€œ8”
    memory: 246Gi
    nvidia.com/gpu: β€œ8"
    ephemeral-storage: 1Gi
    requests:
    cpu: β€œ1”
    memory: 240Gi
    nvidia.com/gpu: β€œ8"
    ephemeral-storage: 10Mi
    #Add the following to watsonxai-cr:
    tuning_disabled: true

Decisions and Action Items (DAI)​

  • Mistral installation
    • Incompatable with a100-40gb

Lessons Learned​

  • LLM model compatabilities with provisioned resources
    • Most IBM LLM's require a100-80gb

Next Steps​

  • Deploy watsonx.ai
  • Install NeuralSeek
  • Application configuration
    • NeuralSeek
    • watsonx.ai Prompt Lab
    • watsonx Assistant
    • watsonx Orchestrate
      • ServiceNow skills
      • Microsoft Outlook skills

Log 23 πŸ›«

Β· 2 min read

Objective​

Deploy watsonx.ai on self-managed AWS infrastructure for customer software evaluation

Milestones​

  1. Deploy and configuration of boot node to establish a beach-head into the customer AWS environment
    • Complete
  2. Deploy OCP using the documented UPI installation steps
    • Complete
  3. Install Cloud Pak for Data
    • Complete
  4. Deploy and configure watsonx.ai on self-managed AWS infrastructure on ref environment and document
    • In Progress

Today's Accomplishments​

  • GPU node provisioned
  • Deployment of Granite model

Summary​

  • Provisioning GPU node
    • Installing GPU operator
  • Installing watsonx.ai operator
    • Waiting on the installer
  • Facing issues with Mistral IBM models for watsonx.ai - support case open

Decisions and Action Items (DAI)​

  • MCG Secrets created for Cloud Pak components
  • Authorized Instance Topology
  • Installed Cloud Pak shared components
  • Installed Knative
  • GPU Node Activity and Billing

Lessons Learned​

  • GPU Node Activity
    • AWS was charging for a GPU Node while powered-down
    • Provisioned the GPU Node using a reserve instance (30 days starting May 6)

Next Steps​

  • Deploy watsonx.ai
  • Install NeuralSeek
  • Application configuration
    • NeuralSeek
    • watsonx.ai Prompt Lab
    • watsonx Assistant
    • watsonx Orchestrate
      • ServiceNow skills
      • Microsoft Outlook skills

Log 22 πŸ›«

Β· One min read

Objective​

Deploy watsonx.ai on self-managed AWS infrastructure for customer software evaluation

Milestones​

  1. Deploy and configuration of boot node to establish a beach-head into the customer AWS environment
    • Complete
  2. Deploy OCP using the documented UPI installation steps
    • Complete
  3. Install Cloud Pak for Data
    • Complete
  4. Deploy and configure watsonx.ai on self-managed AWS infrastructure on ref environment and document
    • In Progress

Today's Accomplishments​

  • Successful installation of watsonx Discovery
  • Successful installation of watson Studio
  • Successful installation of watson Machine Learning
  • Successful installation of IBM Knowledge Catalog

Summary​

  • Trobuleshooting watsonx Assistant installation pods
    • Starting required pods/containers

Decisions and Action Items (DAI)​

  • None today

Lessons Learned​

  • None today

Next Steps​

  • Deploy watsonx.ai
  • Application configurations
  • Application validations

Log 21 πŸ›«

Β· 3 min read

Objective​

Deploy watsonx.ai on self-managed AWS infrastructure for customer software evaluation

Milestones​

  1. Deploy and configuration of boot node to establish a beach-head into the customer AWS environment
    • Complete
  2. Deploy OCP using the documented UPI installation steps
    • Complete
  3. Install Cloud Pak for Data
    • In Progress
  4. Deploy and configure watsonx.ai on self-managed AWS infrastructure on ref environment and document
    • In Progress

Today's Accomplishments​

  • Successful deployment of CP4D
  • Installed Knative

Summary​

  • Customer has approved required contracts and procedures have been followed to attain an entitlement key
  • Updated bootnode IP reference in configuration
  • Re-ran install scripts
  • MCG Secrets created for Cloud Pak components
  • Verify cluster up v4.12.8
    • No errors
    • Check storage size on nodes, have 5TB disks instead of 500gb as intended
      • This was set incorrectly in config, reconfiguring worker nodes with proper (worker-template.yaml)
    • Don't need secondary disks on the nodes, NFS will be used instead
  • Adding GPU node
    • Updating config.sh, gpu_subnet accurate, security groups set properly
    • Logging into AWS via aws_sso
    • Running add_node.sh to add gpu node (runs for about 10 min)
    • Verifying node draining and uncordon node
  • Installing nfs provisioner
    • Operator install unsuccessful
      • Fallback to using helm install
    • Downloading helm repo and install
    • Tested and verified with test pod that attached to nfs-client storageclass, applying clean up

Decisions and Action Items (DAI)​

  • Authorized Instance Topology

Lessons Learned​

  • Had an issue with β€œapply-cluster-components” which requires connecting to github to download CASE files. Found a solution in the cpd-cli documentation: use two additioanl flags on the command β€œ--case_download=true” and β€œ--from_oci=true” which tells cpd-cli to download the CASE files from IBM Open Container Initiative instead of github.
  • While running β€œsetup-instance-topology” for knative, received an error regarding storage. Added β€œ--block_storage_class=${STG_CLASS_BLOCK}” to the command and it completed successfully.

Next Steps​

  • License and configure Cloud Pak for Data
    • Cloud Pak Considerations
      • Security scans needed on container images
      • Customer requires on-prem, offline install
      • Customer uses their own container registry that might introduce extra effort or compatability issues
      • Version compatibility with OpenShift (e.g. 4.10 required and customer has 4.11)
      • Supported storage not available
      • Multiple cloudpaks on the same cluster
      • custom connections to data sources not supported OOTB
      • AWS-specific: IAM users required for install/deploy and are not allowed
      • OpenShift specific: CoreOS requirement for control nodes
      • Automatic updating of Cloud Pak, this can interrupt engagements (solution is to always remove update polling from operators)
  • Deploy watsonx.ai
  • Application configurations
  • Application validations

Log 20 πŸ›«

Β· 2 min read

Objective​

Deploy watsonx.ai on self-managed AWS infrastructure for customer software evaluation

Milestones​

  1. Deploy and configuration of boot node to establish a beach-head into the customer AWS environment
    • Complete
  2. Deploy OCP using the documented UPI installation steps
    • Complete
  3. Install Cloud Pak for Data
    • In Progress
  4. Deploy and configure watsonx.ai on self-managed AWS infrastructure on ref environment and document
    • In Progress

Summary​

  • Customer has approved required contracts and procedures have been followed to attain an entitlement key.

Decisions and Action Items (DAI)​

  • Customer has worked with us to spin up a new cluster
    • Previous cluster had been deleted to save AWS credits
    • IBM to provide tighter instruction for the deployment of CP4D
  • Customer received a GPU reservation
    • GPU node has been ordered and deployed
    • Costs associated with GPU resources are discounted, but the meter is running once the reservation is accepted.

Lessons Learned​

  • watsonx.ai service requires larger local disks on worker nodes (500Gb)
  • The GPU node required for watsonx.ai seems to be a limited resource
  • Had to replace the nodes in the cluster as the attached disks were incorrect

Next Steps​

  • License and configure Cloud Pak for Data
    • Cloud Pak Considerations
      • Security scans needed on container images
      • Customer requires on-prem, offline install
      • Customer uses their own container registry that might introduce extra effort or compatability issues
      • Version compatibility with OpenShift (e.g. 4.10 required and customer has 4.11)
      • Supported storage not available
      • Multiple cloudpaks on the same cluster
      • custom connections to data sources not supported OOTB
      • AWS-specific: IAM users required for install/deploy and are not allowed
      • OpenShift specific: CoreOS requirement for control nodes
      • Automatic updating of Cloud Pak, this can interrupt engagements (solution is to always remove update polling from operators)
  • Deploy watsonx.ai
  • Application configurations
  • Application validations

Log 19 πŸ›«

Β· 2 min read

Objective​

Deploy watsonx.ai on self-managed AWS infrastructure for customer software evaluation

Milestones​

  1. Deploy and configuration of boot node to establish a beach-head into the customer AWS environment
    • Complete
  2. Deploy OCP using the documented UPI installation steps
    • Complete
  3. Install Cloud Pak for Data
    • In Progress
  4. Deploy and configure watsonx.ai on self-managed AWS infrastructure on ref environment and document
    • In Progress

Summary​

  • Awaiting entitlement key approval on customer side

Decisions and Action Items (DAI)​

  • Software evaluation awaiting customer's approval process. This blocks our ability to download software from cp.icr.io
    • Customer to provide by EOD Monday
  • Worker nodes shutdown until approval comes through
  • Drafted and sent instructions for the customer to resize the worker node disks for when the cluster is brought back online
  • Drafted and sent instructions for the customer to order a GPU Node
    • GPU node to be added to the cluster and then cordoned, drained, and shutdown

Lessons Learned​

  • Preparation for Cloud Pak for Data on OpenShift sizing needed to be adjusted to reflect an under-provisioning of CPU resources
  • watsonx.ai service requires larger local disks on worker nodes (500Gb)
  • The GPU node required for watsonx.ai seems to be a limited resource

Next Steps​

  • License and configure Cloud Pak for Data
    • Cloud Pak Considerations
      • Security scans needed on container images
      • Customer requires on-prem, offline install
      • Customer uses their own container registry that might introduce extra effort or compatability issues
      • Version compatibility with OpenShift (e.g. 4.10 required and customer has 4.11)
      • Supported storage not available
      • Multiple cloudpaks on the same cluster
      • custom connections to data sources not supported OOTB
      • AWS-specific: IAM users required for install/deploy and are not allowed
      • OpenShift specific: CoreOS requirement for control nodes
      • Automatic updating of Cloud Pak, this can interrupt engagements (solution is to always remove update polling from operators)
    • Resize local disks for worker nodes
    • Customer to order a GPU node and attach it to the cluster
  • Deploy watsonx.ai
  • Application configurations
  • Application validations

Log 18 πŸ›«

Β· 2 min read

Objective​

Deploy watsonx.ai on self-managed AWS infrastructure for customer software evaluation

Milestones​

  1. Deploy and configuration of boot node to establish a beach-head into the customer AWS environment
    • Complete
  2. Deploy OCP using the documented UPI installation steps
    • Complete
  3. Install Cloud Pak for Data
    • In Progress
  4. Deploy and configure watsonx.ai on self-managed AWS infrastructure

Today's Acomplishments​

  • CP4D Final Preparations
    • Added options to the CPD VARS file
    • Recreation of work dir

Summary​

  • Trobuleshooting the CP4D CLI
  • Awaiting entitlement key approval

Decisions and Action Items (DAI)​

  • Software evaluation awaiting customer's approval process. This blocks our ability to download software from cp.icr.io
    • Customer to provide by EOD Monday

Lesons Learned​

  • Preparation for Cloud Pak for Data on OpenShift sizing needed to be adjusted to reflect an under-provisioning of CPU resources

Objective​

  • Start building guided workflows
  • Attempt to improve parsing of unstructured tables with WDU (Watson Document Understanding)

Milestones​

  1. Designed some guided workflow concept ideas
  2. Coded flask app to expose an API to send emails to users from the Watsonx Assistant
  3. Table parsing with WDU sucesfully configured

Next Steps​

  • Integrate agent based workflows into guided workflows (langchain agents)
  • Investigate if its possible to improve table parsing
  • License and configure Cloud Pak for Data
    • Cloud Pak Considerations
      • Security scans needed on container images
      • Customer requires on-prem, offline install
      • Customer uses their own container registry that might introduce extra effort or compatability issues
      • Version compatibility with OpenShift (e.g. 4.10 required and customer has 4.11)
      • Supported storage not available
      • Multiple cloudpaks on the same cluster
      • custom connections to data sources not supported OOTB
      • AWS-specific: IAM users required for install/deploy and are not allowed
      • OpenShift specific: CoreOS requirement for control nodes
      • Automatic updating of Cloud Pak, this can interrupt engagements (solution is to always remove update polling from operators)
  • Deploy watsonx.ai
  • Application configurations
  • Application validations

Log 17 πŸ›«

Β· 2 min read

Objective​

Deploy watsonx.ai on self-managed AWS infrastructure for customer software evaluation

Milestones​

  1. Deploy and configuration of boot node to establish a beach-head into the customer AWS environment
    • Complete
  2. Deploy OCP using the documented UPI installation steps
    • Complete
  3. Install Cloud Pak for Data
    • In Progress
  4. Deploy and configure watsonx.ai on self-managed AWS infrastructure

Today's Acomplishments​

  • Kublet configuration update applied
  • Deployed instance of Multicloud Object Gateway

Summary​

  • Reinstalled NFS provisioner (Helm)
  • Installed OpenShift Data Foundation Operator
  • Deployed MultiCloud Object Gateway
  • OpenShift portal is active and cluster appears healthy
  • Configuring CP4D CLI
  • Awaiting for final cluster nodes to update through machine config pool

Decisions and Action Items (DAI)​

  • Software evaluation awaiting customer's approval process. This blocks our ability to download software from cp.icr.io
    • Customer to escalate internally

Lesons Learned​

  • Preparation for Cloud Pak for Data on OpenShift sizing needed to be adjusted to reflect an under-provisioning of CPU resources
    • This was resolved by..

Next Steps​

  • License and configure Cloud Pak for Data
    • Cloud Pak Considerations
      • Security scans needed on container images
      • Customer requires on-prem, offline install
      • Customer uses their own container registry that might introduce extra effort or compatability issues
      • Version compatibility with OpenShift (e.g. 4.10 required and customer has 4.11)
      • Supported storage not available
      • Multiple cloudpaks on the same cluster
      • custom connections to data sources not supported OOTB
      • AWS-specific: IAM users required for install/deploy and are not allowed
      • OpenShift specific: CoreOS requirement for control nodes
      • Automatic updating of Cloud Pak, this can interrupt engagements (solution is to always remove update polling from operators)
  • Deploy watsonx.ai
  • Application configurations
  • Application validations

Log 16 πŸ›«

Β· 2 min read

Objective​

Deploy watsonx.ai on self-managed AWS infrastructure for customer software evaluation

Milestones​

  1. Deploy and configuration of boot node to establish a beach-head into the customer AWS environment
    • Complete
  2. Deploy OCP using the documented UPI installation steps
    • Complete
  3. Install Cloud Pak for Data
    • In Progress
  4. Deploy and configure watsonx.ai on self-managed AWS infrastructure

Summary​

  • Master nodes attempting to upgrade and are in a stuck state preventing rollback of ingress changes
  • Attempting to deploy a new cluster
  • New cluster successfully deployed via latest script
  • Deploying EFS
  • Installing OpenShift Data Foundation & Standalone Multicloud Object Gateway

Decisions and Action Items (DAI)​

  • Software evaluation awaiting customer's approval process. This blocks our ability to download software from cp.icr.io
    • Customer to escalate internally

Next Steps​

  • License and configure Cloud Pak for Data
    • Cloud Pak Considerations
      • Security scans needed on container images
      • Customer requires on-prem, offline install
      • Customer uses their own container registry that might introduce extra effort or compatability issues
      • Version compatibility with OpenShift (e.g. 4.10 required and customer has 4.11)
      • Supported storage not available
      • Multiple cloudpaks on the same cluster
      • custom connections to data sources not supported OOTB
      • AWS-specific: IAM users required for install/deploy and are not allowed
      • OpenShift specific: CoreOS requirement for control nodes
      • Automatic updating of Cloud Pak, this can interrupt engagements (solution is to always remove update polling from operators)
  • Deploy watsonx.ai
  • Application configurations
  • Application validations

Log 15 πŸ›«

Β· 2 min read

Objective​

Deploy watsonx.ai on self-managed AWS infrastructure for customer software evaluation

Milestones​

  1. Deploy and configuration of boot node to establish a beach-head into the customer AWS environment
    • Complete
  2. Deploy OCP using the documented UPI installation steps
    • Complete
  3. Install Cloud Pak for Data
    • In Progress
  4. Deploy and configure watsonx.ai on self-managed AWS infrastructure

Today's Accomplishments​

  • Verifying cluster health in preparation for Cloud Pak for Data install
  • Verifying network connectivity to application pods

Summary​

  • Attempting to resolve domain name of the OpenShift portal
    • Added required cluster-wide settings to proxy
      • Added wxai domain information to noProxy configuration oc edit proxy/cluster
      • Investigating OpenShift error certificate is valid for oauth-openshift.openshift-authentication.svc
      • Cluster pieces updating, validating health of cluster on later call (may be related to current certificate errors)
    • More nodes in "ready" state
    • 1 master node continuing to update causing cluster connectivity issues
    • Attempting to drain and restart pending nodes

Decisions and Action Items (DAI)​

  • Software evaluation awaiting customer's approval process. This blocks our ability to download software from cp.icr.io
    • Customer to escalate internally

Next Steps​

  • License and configure Cloud Pak for Data
    • Cloud Pak Considerations
      • Security scans needed on container images
      • Customer requires on-prem, offline install
      • Customer uses their own container registry that might introduce extra effort or compatability issues
      • Version compatibility with OpenShift (e.g. 4.10 required and customer has 4.11)
      • Supported storage not available
      • Multiple cloudpaks on the same cluster
      • custom connections to data sources not supported OOTB
      • AWS-specific: IAM users required for install/deploy and are not allowed
      • OpenShift specific: CoreOS requirement for control nodes
      • Automatic updating of Cloud Pak, this can interrupt engagements (solution is to always remove update polling from operators)
  • Deploy watsonx.ai
  • Application configurations
  • Application validations

Log 14 πŸ›«

Β· 2 min read

Objective​

Deploy watsonx.ai on self-managed AWS infrastructure for customer software evaluation

Milestones​

  1. Deploy and configuration of boot node to establish a beach-head into the customer AWS environment
    • Complete
  2. Deploy OCP using the documented UPI installation steps
    • Complete
  3. Install Cloud Pak for Data
    • In Progress
  4. Deploy and configure watsonx.ai on self-managed AWS infrastructure

Today's Accomplishments​

  • Installed OpenShift for Data CLI on bastion host

Summary​

  • Installed Cloud Pak for Data CLI on bastion host
  • Attempting to resolve console URL from customer host (laptop), external to cluster and bastion
    • Customer is unable to add entries to Windows host file due to local administrator requirements
    • Adding hostname resolution for CP4D
  • Customer doing offline
    • Adding resolvable URL for CP4D, allows for proper CP4D application communication (see Action Item)
      • Customer process - sending 'get' requests requires customer security approval process
    • Customer to follow 'multi object gateway' instructions
    • Customer to follow internal process for software trial evaluation, supporting documentation sent

Decisions and Action Items (DAI)​

  • Software evaluation licenses for CP4D and watsonx.ai
  • Customer decision is required to determine cluster console access

Next Steps​

  • License and configure Cloud Pak for Data
    • Cloud Pak Considerations
      • Security scans needed on container images
      • Customer requires on-prem, offline install
      • Customer uses their own container registry that might introduce extra effort or compatability issues
      • Version compatibility with OpenShift (e.g. 4.10 required and customer has 4.11)
      • Supported storage not available
      • Multiple cloudpaks on the same cluster
      • custom connections to data sources not supported OOTB
      • AWS-specific: IAM users required for install/deploy and are not allowed
      • OpenShift specific: CoreOS requirement for control nodes
      • Automatic updating of Cloud Pak, this can interrupt engagements (solution is to always remove update polling from operators)
  • Deploy watsonx.ai
  • Application configurations
  • Application validations

Log 13 πŸ›«

Β· One min read

Objective​

  • Evaluate Watsonx Discovery vs Chromadb for vector store
  • Start ideating & testing sample complete workflows

Milestones​

  1. Successfully configured working Wx Discovery Vector Store with chunking, embeddings and ELSER semantic search
  2. Connected this KB in WXD to Watsonx Assistant
  3. Developed method to parse & detect RAG responses that give step by step instructions, which can be used to create guided workflows

Lessons learned​

  • Wx Discovery seems to have more accurate chunk retrieval than Chromadb
  • Wx Discovery out-of-the box chatbot answers are also better
  • Config is simple and can be done in an hour with prewritten code by just passing in credentials

Next Steps​

  • Continue building guided conversations & enable them to conduct operations autonomously

Log 12 πŸ›«

Β· 3 min read

Objective​

Deploy watsonx.ai on self-managed AWS infrastructure for customer software evaluation

Milestones​

  1. Deploy and configuration of boot node to establish a beach-head into the customer AWS environment
    • Complete
  2. Deploy OCP using the documented UPI installation steps
    • In progress
  3. Install Cloud Pak for Data
  4. Deploy and configure watsonx.ai on self-managed AWS infrastructure

Today's Accomplishments​

  • Successful deployment of OpenShift
  • Successful setup of storage class

Summary​

  • Nodes were shut down after-hours by customer compliance automated scan
    • All nodes must be whitelisted by customer security
  • Validating health of OCP installation
    • All nodes started and responding
    • Investigating pods
      • Some pods appear to be stuck due to node shutdowns
      • Deleting non-responsive pods
      • Replacing ICMP range 0 with all on 10.0.0.0/8
  • Issue - ConnectivityCheckController is waiting for transition to desired version (4.12.8) to be completed.
    • Investigating proxy configuration
    • Adding cluster domain to proxy configuration - configuring local nodes to not use proxy
  • Fix: Adding noproxy spec to proxy configuration allowed for traffic locally (not through proxy) for nodes
  • Waiting for configurations to apply (automatically)
  • OCP cluster verified working
  • Adding storage to cluster for CP4D support
    • Creating storage class
    # Requires kubeconfig
    oc new-project nfs-provisioner
    oc config set-context --current --namespace=nfs-provisioner
    helm repo add nfs-subdir-external-provisioner https://kubernetes-sigs.github.io/nfs-subdir-external-provisioner/
    helm install nfs-subdir-external-provisioner nfs-subdir-external-provisioner/nfs-subdir-external-provisioner \
    --namespace nfs-provisioner \
    --set nfs.server=<EFS URL> \
    --set nfs.path=/ \
    --set storageClass.defaultClass=true
    • Initial install - Local helm install needed
    • Retrying by retrieving the external provisioner and copying locally
    • Default storage class operational
    • Deleted test pod
    • Deleted pvx

Lessons learned​

OCP Deployment​

  • Customer environment heavily affected configuration of the original deployment script and process
    • Security considerations
    • Proxy configurations in setup

Decisions and Action Items (DAI)​

  • Software evaluation licenses for CP4D and watsonx.ai
  • Customer decision is required to determine cluster console access
  • Add documentation for the CP4D deployment

Next Steps​

  • License and configure Cloud Pak for Data
    • Cloud Pak Considerations
      • Security scans needed on container images
      • Customer requires on-prem, offline install
      • Customer uses their own container registry that might introduce extra effort or compatability issues
      • Version compatibility with OpenShift (e.g. 4.10 required and customer has 4.11)
      • Supported storage not available
      • Multiple cloudpaks on the same cluster
      • custom connections to data sources not supported OOTB
      • AWS-specific: IAM users required for install/deploy and are not allowed
      • OpenShift specific: CoreOS requirement for control nodes
      • Automatic updating of Cloud Pak, this can interrupt engagements (solution is to always remove update polling from operators)
  • Deploy watsonx.ai

Log 11 πŸ›«

Β· 4 min read

Objective​

Deploy watsonx.ai on self-managed AWS infrastructure for customer software evaluation

Milestones​

  1. Deploy and configuration of boot node to establish a beach-head into the customer AWS environment
    • Complete
  2. Deploy OCP using the documented UPI installation steps
    • In progress
  3. Install Cloud Pak for Data
  4. Deploy and configure watsonx.ai on self-managed AWS infrastructure

Today's Accomplishments​

Summary​

  • Significant progress made in applying the required configurations according to the customer's environment policies
  • Master and Worker nodes responding

Script Attempts​

Cleanup Process​

  • Delete metadata file from "wxai" directory
  • Delete stacks created by create_cluster_step_2.sh
  • Remove install state
  • Ignore first "FATAL" error logged when running create_cluster_step_2.sh

Attempt 1​

  • Communication Issues
    1. httpd not running on bootnode/bastion due to previous reboot
    • Fix: Enable httpd service on OS. Script change also made to force
    1. Egress rules added to bootnode and master
      • ALL AWS default egress connections needed to be manually configured to 10.0.0.0/8 vs AWS default value 0.0.0.0/0 for "all" traffic

Attempt 2​

  • Running OCP process manually (outside of script)
  • Unable to pull images
    • Incorrectly pulling images from the bastion host, should be local registry
    • Temporary fix: Run ./start_registy.sh /ibm 5000
    • Ignition configuration issue causing error
    • Fix: Add deleting state data to the cleanup process

Attempt 3​

  • Starting from scratch
  • Running cleanup process
  • Issue found in authentication configuration. The script is improperly configured to more than 1 authentication set
    • This customer deployment requires multiple authentication sets: quai.io, RedHat registry, Artifactory. Only one was tested
  • Using workaround by manually adding the pullsecret to the create_install_config.sh
  • Running create_cluster_step_1.sh
    • Successful
  • Updated LB DNS configurations manually (to be included in code changes, see Attempt 1)
  • Running create_cluster_step_2.sh
  • Stalled - ignitions not firing
    • bootnode and master security group needed IP range (additonal egress configuration issues, added code changes, see Attempt 1)
  • Error
    • Issue with Openshift installer extraction. For this customer case, we are not using local registry
    • Fix: Use general use OpenShift installer from Redhat, which does not assume local registry

Attempt 4​

  • Running cleanup process
  • Running create_cluster_step_2.sh
  • Updated LB DNS configurations
    • Worker security group needed validIP range (addition to Attempt 1 and Attempt 2 egress issues, added code changes)
      • Replaced IP ranges 0-0 with 0-65535
  • Unable to use OpenShift API (oc command) to view pods due to use of untrusted certificates
    • Testing workaround use "insecure" connection by adding flag --insecure-skip-tls-verify when using oc
    • Example oc get pods -A --insecure-skip-tls-verify
  • Removed worker node external volumes (1tb) from script configuration

Attempt 5​

  • Retrying using default OpenShift Certificates (bypassing/not creating or using the CA certificate from documentation steps)
    • Updated config and removed certificate configuration
  • Running cleanup process
  • Running create_cluster_step_2.sh
  • Witnessing certificate failures in script output, but continuing install
  • Error: Worker nodes not communicatting
  • Fix/Root Causes
    • Removed IPI artififact from script
    • Removed {registry_url} image content sources from imageContentSources (Airgap) from install config.sh
    • Removed fips mode from install config.sh

Attempt 6​

  • Running cleanup process
  • Running additional cleanup steps:
    • Removed .kube
    • unset $KUBCONFIG
    • Delete ignition file
  • Running create_cluster_step_2.sh
  • Worker nodes responding
  • Certificate error resolved
    • Potential root cause(s) (see fixes from Attempt 5)
      • IPI artififacts in script
      • {registry_url} image content sources from imageContentSources (Airgap) from install config.sh
      • fips mode from install config.sh
  • Errors generated from OpenShift to be tracked in next flight log

Decisions and Action Items (DAI)​

  • Validate cluster installation state
  • Software evaluation licenses for CP4D and watsonx.ai
    • Pending approval process

Next Steps​

  • License and configure Cloud Pak for Data
    • Cloud Pak Considerations
      • Security scans needed on container images
      • Customer requires on-prem, offline install
      • Customer uses their own container registry that might introduce extra effort or compatability issues
      • Version compatibility with OpenShift (e.g. 4.10 required and customer has 4.11)
      • Supported storage not available
      • Multiple cloudpaks on the same cluster
      • custom connections to data sources not supported OOTB
      • AWS-specific: IAM users required for install/deploy and are not allowed
      • OpenShift specific: CoreOS requirement for control nodes
      • Automatic updating of Cloud Pak, this can interrupt engagements (solution is to always remove update polling from operators)
  • Deploy watsonx.ai

Log 10 πŸ›«

Β· 5 min read

Objective​

Deploy watsonx.ai on self-managed AWS infrastructure for customer software evaluation

Milestones​

  1. Deploy and configuration of boot node to establish a beach-head into the customer AWS environment
    • Complete
  2. Deploy OCP using the documented UPI installation steps
    • In progress
  3. Install Cloud Pak for Data
  4. Deploy and configure watsonx.ai on self-managed AWS infrastructure

Today's Accomplishments​

Validation of current deployment status​

  • Verify β€˜quay.io’ is the registry in config.sh

    • Verified in the registry
  • Add /usr/local/bin to .bashrc and .bash_profile for root

  • Create a small instance on a different subnet, same VPC, and confirm that IP can be curled. Make require adjusting the Security Group rules for the bootnode. If 8080 fails, then HTTPD config will need be to changed to port 80 and service restarted

    • Initial onnectectivity over port 8080 failed
    • Fixed by opening port via security group
  • Changed certificate organization (O) to match the domain

  • Cert validated - current certificate using output of openssl x509 -in /ibm/security/certs/ca.crt -text -noout

    Issuer: C = US, O = ec2.internal, CN = CA
    Subject: C = US, O = ec2.internal, CN = CA
  • Changed to

    Issuer: C = US, O = `customer domain name`, CN = CA
    Subject: C = US, O = `customer domain name`, CN = CA

Script Attempts​

Cleanup Steps​

  • Remove metadata file from "wxai" directory
  • Delete stacks created
  • Ignore initial "FATAL" error logged

Attempt 1​

  • Running create_cluster_step_1.sh
    • Applied required security tagging, as customer's security scans "remediate" (delete) improperly tagged items
    • Depoyment script changes
      • Changed resource types and sizes. Example: gp2 -> gpt3
    • Renaming "bootnode" nomenclature to bastion host bastion.'basedomain'
    • Renamed certificate organization to match customer domain
  • Reran create registry script
  • Proceeded with DNS steps for new Elastic Load Balencer from prevous script
  • Running create_cluster_step_2.sh
    • Bootstrap Error
      Every Parameters object must contain a Type member
      An error occurred (ValidationError) when calling the DescribeStacks operation: Stack with id ibmwxai-6wvkv-bootstrap-stack does not exist
    • Solution - Needed to add Type string for parameter
      BootstrapIgnitionLocation:
      Default: s3://my-s3-bucket/bootstrap.ign
      Description: Ignition config file location.
      Type: String ### This line was not here

Attempt 2​

  • AMI Error (repeat)
    • Cause: Customer security team denies all public AMI access
    • Fixed: Customer approved public AMI usage (for our specific AMI ID for the CoreOS)

Attempt 3​

  • yaml validation error (new) Parameter validation failed: parameter value for parameter name Master1Subnet does not exist. Rollback requested by user
    • Investigated why script is not generating parameter for Master1Subnet
    • Fix: Typo found in script create_control_plane_param.sh - masters1ubnet -> master1subnet

Attempt 4​

  • Notified of non-compliance during attempt via email from customer security to customer host
    1. Rule Formatting

      • Summary: Automated customer security scan "remediation" removed non-compliant security groups on bootstrap and master (Ingress and Egress)
      Security Event: Security Group with Unapproved Egress. The security group non-compliant egress rules have been deleted. Please check your application to ensure the functionality has not been negatively impacted.
      Security Event: Security Group with Unapproved Ingress. The security group non-compliant ingress rules have been deleted. Please check your application to ensure the functionality has not been negatively impacted.
      • LB template currently assuming public in sg-lb-template.yaml, bootstrap-template.yaml
      CidrIp: 0.0.0.0/0
      • Fix: CidrIps need to be replaced with proper public format per customer security
      # Replace all instances of 0.0.0.0/0 with
      CidrIp: 10.0.0.0/8
    2. Encryption Requirements

      • All volumes must be encrypted
      • Fix: Cloudforms template must be updated to create encrypted resources

Attempt 5​

  • Redeployment cleanup steps
  • Running create_cluster_step_1.sh
    • Successful
  • Capture DNS output
  • Add DNS output to config
  • Running create_cluster_step_1.sh
  • Running create_cluster_step_2.sh
    • Customer had a hard stop for the day. Awaiting feedback for next session

Decisions and Action Items (DAI)​

  • Adding creation of the ssh key as root user on the bastion node
  • CoreOS AMI approval from customer (Public AMI's are blocked)
    • AMI approved, step 2 script succeeded AMI portion
  • Customer security policies
    • Ongoing: Port rule formatting (Example: Using 10.0.0.0/8 instead of 0.0.0.0/0)
    • Cleared: Role authorizations
  • Software evaluation licenses for CP4D and watsonx.ai
    • Pending approval process
  • Potential Proxy Configuration Error
    • Prepared code changes for create_cluster_step_2.sh next session

Next Steps​

  • License and configure Cloud Pak for Data
    • Cloud Pak Considerations
      • Security scans needed on container images
      • Customer requires on-prem, offline install
      • Customer uses their own container registry that might introduce extra effort or compatability issues
      • Version compatibility with OpenShift (e.g. 4.10 required and customer has 4.11)
      • Supported storage not available
      • Multiple cloudpaks on the same cluster
      • custom connections to data sources not supported OOTB
      • AWS-specific: IAM users required for install/deploy and are not allowed
      • OpenShift specific: CoreOS requirement for control nodes
      • Automatic updating of Cloud Pak, this can interrupt engagements (solution is to always remove update polling from operators)
      • Deploy watsonx.ai

Log 9 πŸ›«

Β· 3 min read

Objective​

Deploy watsonx.ai on self-managed AWS infrastructure for customer software evaluation

Milestones​

  1. Deploy and configuration of boot node to establish a beach-head into the customer AWS environment
    • Complete
  2. Deploy OCP using the documented UPI installation steps
    • In progress
  3. Install Cloud Pak for Data
  4. Deploy and configure watsonx.ai on self-managed AWS infrastructure

Today's Accomplishments​

  • Configuration of the boot node
    • Installation of prerequisite software onto the boot node
    • Created and started local registry
    • Generated CA certrificate for PKI architecture
  • Completion of step 1 of 2 of the deployment script

Lessons Learned​

  • Storage insufficient on the bootnode for downloaded images, 400gb minimum required
    • Mitigation: We increased the boot disk size to 500 gig via the AWS console for the EC2 instance. We then grew the disk and grew the filesystem
    • This needs to be added as a prereq
  • There was a constraint in the sg-lb-template.yaml requiring subnets sized from /16 o /24. We removed that constraint
  • Edited bootstrap-template.yaml line 91 to remove the wrong key name. (artifact from testing)
  • Software Evaluation process - define and build internal documentation - TBD
  • Documentation updates
    • Parameter definitions - making them more descriptive
  • Validation checks
    • Creating a validation process before runniing any scripts/installs checking for prerequisites

Decisions and Action Items (DAI)​

  • AWS CLI had a previous installation. Had to manually remove that installation and re-run the aws cli install command.
  • We decided to run the installation as root user. Root user needed to have the /usr/local/bin added to the PATH. Did this manually on the fly with an export command.
  • Customer security to approve selected AMI for coreOS
    • The AMI for CoreOS is a public AMI. The customer security team needs to copy it into the dev account as public AMI's are blocked in this environment

Next Steps​

  • License and configure Cloud Pak for Data
    • Cloud Pak Considerations
      • Security scans needed on container images
      • Customer has no OpenShift experience
      • Customer requires on-prem, offline install
      • Customer uses their own container registry that might introduce extra effort or compatability issues
      • Version compatibility with OpenShift (e.g. 4.10 required and customer has 4.11)
      • Supported storage not available
      • Multiple cloudpaks on the same cluster
      • custom connections to data sources not supported OOTB
      • AWS-specific: IAM users required for install/deploy and are not allowed
      • OpenShift specific: CoreOS requirement for control nodes
      • Automatic updating of Cloud Pak, this can interrupt engagements (solution is to always remove update polling from operators)
  • Deploy watsonx.ai

Log 8 πŸ›«

Β· 2 min read

Objectives​

  • Deploy watsonx.ai on self-managed AWS infrastructure.

Accomplishments​

AWS

  • Fixes to cluster-sts.yaml and other deployment resources.
    • Fixed error in cluster-sts.yml by commenting out lines 590-599.
    • Changed IamInstanceProfile: !Ref BootnodeInstanceProfile to IamInstanceProfile: <InstanceProfileName>
    • Changed SubnetId: !Ref PublicSubnet1ID to SubnetId: <PrivateSubnetID> to account for private deployments
    • Updated LambdaExecutionRole.json line 14: from ec2.aws.com to lambda.aws.com and added cloudformation.aws.com of allowed services.
    • Fixed LambdaExecutionRole ARN to proper role name.
    • Commented out /bin/bash ./cp-deploy.sh env apply -e env_id=${ClusterName} [--accept-all-licenses]
    • Added VPC and Subnet IDs to the β€œCleanupLambda” lambda function in cluster-sts, which then required adding β€œec2:CreateNetworkInterface” permission to LambdaExecutionRole
    • Adding tags to CleanupLambda with Application IDs.
  • Successful deployment of BootNode instance.

RAG

  • Creation of cronjob to capture logs from Python app.
  • Enabled metadata insertion into chunks in vector store -> (hopefully) increases retrieval accuracy
  • Return context to user (shows sources used to generate responses)
  • Added mixtral model support
  • Enable functionality for user to give custom rag parameters
  • Migrated vector DB from FAISS to chromaDB to enable the metadata functionality
  • Script written to easily test rag implementation and save results in csv
  • Implemented cache logic to make sure it considers combination of parameters as well before chosing to send a cached response
  • Added better logic for caching to improve performance
  • Remove unwanted parameters from request body

In Progress​

  • End-to-end deployment of OCP, CP4D, and watsonx.ai (with GPU node)
  • Tagging cp-deployer.sh generated resources.
  • Updating solution docs with better asset linking.
  • Exploring WatsonX Discovery

Next Steps​

  • Continue over the shoulder working sessions
    • Kick off CloudFormation template install with updated STS templates.
  • Compilation of required endpoints
  • Deploy latest RAG version on AWS
  • Build out actions & flow in Watsonx Assistant after properly defining personas & objectives.
  • Kick off Cloud Pak for Deployment entitlement key.
  • Build RAG application using WatsonX Discovery.
  • Compare WatsonX Discovery RAG with existing RAG results.

Tracking (Issues)​

  • Require sign-off on final CloudFormation template.
  • Red Hat CoreOS AMI pending approval.
  • LambdaCleanup error from not being able to assume role.
  • Double checking role names in Cloudformation template.

Log 7 πŸ›«

Β· 2 min read

Objectives​

  • Deploy watsonx.ai on self-managed AWS infrastructure.

Accomplishments​

AWS

  • Shift from CP Deployer to OpenShift UPI deployment.
  • Artifactory proxy details procured.
  • Discussion of on-site logistics
  • RHEL 8 AMI changed for BootNode.

RAG

  • Creation of cronjob to capture logs from Python app.
  • Enabled metadata insertion into chunks in vector store -> (hopefully) increases retrieval accuracy
  • Return context to user (shows sources used to generate responses)
  • Added mixtral model support
  • Enable functionality for user to give custom rag parameters
  • Migrated vector DB from FAISS to chromaDB to enable the metadata functionality
  • Script written to easily test rag implementation and save results in csv
  • Implemented cache logic to make sure it considers combination of parameters as well before chosing to send a cached response

In Progress​

  • End-to-end deployment of OCP, CP4D, and watsonx.ai (with GPU node)
  • Tagging cp-deployer.sh generated resources.
  • Updating solution docs with better asset linking.

Next Steps​

  • Setup bootnode with necessary downloads and resources.
  • Creation of IAM Role request creation Cloudformation templates.
  • Kick off on-site over the shoulder working sessions.
  • Collating information and resources to be created via OpenShift UPI deployment.
  • Setup Artifactory proxy.
  • Kick off Cloud Pak for Deployment entitlement key.
  • Deploy latest RAG version on AWS
  • Build out actions & flow in Watsonx Assistant after properly defining personas & objectives.

Tracking (Issues)​

  • Require sign-off on final CloudFormation template.
  • Red Hat CoreOS AMI still pending approval.

Log 6 πŸ›«

Β· 2 min read

Objectives​

  • Deploy watsonx.ai on self-managed AWS infrastructure.

Accomplishments​

AWS

  • Fixes to cluster-sts.yaml and other deployment resources.
    • Fixed error in cluster-sts.yml by commenting out lines 590-599.
    • Changed IamInstanceProfile: !Ref BootnodeInstanceProfile to IamInstanceProfile: <InstanceProfileName>
    • Changed SubnetId: !Ref PublicSubnet1ID to SubnetId: <PrivateSubnetID> to account for private deployments
    • Updated LambdaExecutionRole.json line 14: from ec2.aws.com to lambda.aws.com and added cloudformation.aws.com of allowed services.
    • Fixed LambdaExecutionRole ARN to proper role name.
    • Commented out /bin/bash ./cp-deploy.sh env apply -e env_id=${ClusterName} [--accept-all-licenses]
    • Added VPC and Subnet IDs to the β€œCleanupLambda” lambda function in cluster-sts, which then required adding β€œec2:CreateNetworkInterface” permission to LambdaExecutionRole
    • Adding tags to CleanupLambda with Application IDs.
  • Successful deployment of BootNode instance.

RAG

  • Creation of cronjob to capture logs from Python app.
  • Enabled metadata insertion into chunks in vector store -> (hopefully) increases retrieval accuracy
  • Return context to user (shows sources used to generate responses)
  • Added mixtral model support
  • Enable functionality for user to give custom rag parameters
  • Migrated vector DB from FAISS to chromaDB to enable the metadata functionality
  • Script written to easily test rag implementation and save results in csv
  • Implemented cache logic to make sure it considers combination of parameters as well before chosing to send a cached response

In Progress​

  • End-to-end deployment of OCP, CP4D, and watsonx.ai (with GPU node)
  • Tagging cp-deployer.sh generated resources.
  • Updating solution docs with better asset linking.

Next Steps​

  • Continue over the shoulder working sessions
    • Kick off CloudFormation template install with updated STS templates.
  • Compilation of required endpoints
  • Deploy latest RAG version on AWS
  • Build out actions & flow in Watsonx Assistant after properly defining personas & objectives.
  • Kick off Cloud Pak for Deployment entitlement key.

Tracking (Issues)​

  • Require sign-off on final CloudFormation template.
  • Red Hat CoreOS AMI pending approval.
  • LambdaCleanup error from not being able to assume role.
  • Double checking role names in Cloudformation template.

Log 5 πŸ›«

Β· 2 min read

Objectives​

  • Deploy watsonx.ai on self-managed AWS infrastructure.

Accomplishments​

AWS

  • Populated parameter overrides JSON.
  • Created RH Trial account and uploaded pull secret to S3 bucket.
  • Updated CloudFormation STS template with permissions to create and assume Role with respective JSON versions.

RAG

  • Creation of cronjob to capture logs from Python app.
  • Enabled metadata insertion into chunks in vector store -> (hopefully) increases retrieval accuracy
  • Return context to user (shows sources used to generate responses)
  • Added mixtral model support
  • Enable functionality for user to give custom rag parameters
  • Migrated vector DB from FAISS to chromaDB to enable the metadata functionality
  • Script written to easily test rag implementation and save results in csv
  • Implemented cache logic to make sure it considers combination of parameters as well before chosing to send a cached response

In Progress​

  • End-to-end deployment of OCP, CP4D, and watsonx.ai (with GPU node)
  • Tagging cp-deployer.sh generated resources.
  • Updating solution docs with better asset linking.

Next Steps​

  • Continue over the shoulder working sessions
    • Kick off CloudFormation template install with updated STS templates.
  • Compilation of required endpoints
  • Deploy latest RAG version on AWS
  • Build out actions & flow in Watsonx Assistant after properly defining personas & objectives.

Tracking (Issues)​

  • Require sign-off on final CloudFormation template.
  • CoreOS AMI pending approval.

Log 4 πŸ›«

Β· 2 min read

Objectives​

  • Deploy watsonx.ai on self-managed AWS infrastructure.

Accomplishments​

AWS

  • Reviewed list of missing values able to be added to role via Policy
  • Sent parameter overrides list to be populated for CloudFormation template installation.
  • Creation of three separate CloudFormation template Roles.
  • Updated CloudFormation templates to use STS.

RAG

  • Creation of cronjob to capture logs from Python app.
  • Enabled metadata insertion into chunks in vector store -> (hopefully) increases retrieval accuracy
  • Return context to user (shows sources used to generate responses)
  • Added mixtral model support
  • Enable functionality for user to give custom rag parameters
  • Migrated vector DB from FAISS to chromaDB to enable the metadata functionality
  • Script written to easily test rag implementation and save results in csv
  • Implemented cache logic to make sure it considers combination of parameters as well before chosing to send a cached response

In Progress​

  • End-to-end deployment of OCP, CP4D, and watsonx.ai (with GPU node)
  • Tagging cp-deployer.sh generated resources.

Next Steps​

  • Continue over the shoulder working sessions
    • Kick off CloudFormation template install
  • Compilation of required endpoints
  • Fill out required network values required for OCP deployment.
  • Deploy latest RAG version on AWS
  • Build out actions & flow in Watsonx Assistant after properly defining personas & objectives.

Tracking (Issues)​

  • Require sign-off on final CloudFormation template.
  • Getting access to CoreOS AMI.

Log 3 πŸ›«

Β· One min read

Objectives​

  • Deploy watsonx.ai on self-managed AWS infrastructure.

Accomplishments​

AWS

  • Discovery of AWS DevOps role to be used and augmented with permissions.
  • Adjusted check-permissions.sh script to account for profile to be passed.
  • Creation of Cloudformation templates for roles with permissions needed for install.
    • Added --profile and $PROFILE_NAME
  • Adjusted Cloudformation templates to account for roles instead of a user.

RAG

  • App deployed on Fyre VM
  • Support for granitev2/llama2 70 b chat models added.
  • Watsonx Assistant Configured to interact with app via API for easier testing.

In Progress​

  • End-to-end deployment of OCP, CP4D, and watsonx.ai (with GPU node)
  • Tagging cp-deployer.sh generated resources.
  • Test out RAG new chunking method.

Next Steps​

  • Continue over the shoulder working sessions
  • Compilation of required endpoints
  • Fill out required network values required for OCP deployment.
  • Add Mixtral model to RAG.
  • Deploy latest RAG version on AWS
  • Build out actions & flow in Watsonx Assistant after properly defining personas & objectives.

Tracking (Issues)​

  • Require sign-off on final CloudFormation template.

Log 2 πŸ›«

Β· 2 min read

Objectives​

Accomplishments​

AWS

RAG

  • App deployed on Fyre VM
  • Support for granitev2/llama2 70 b chat models added.
  • Watsonx Assistant Configured to interact with app via API for easier testing.

In Progress​

  • End-to-end deployment of OCP, CP4D, and watsonx.ai (with GPU node)
  • Tagging cp-deployer.sh generated resources.
  • Test out RAG new chunking method.

Next Steps​

Tracking (Issues)​

  • Require sign-off on final CloudFormation template.

Log 1 πŸ›«

Β· One min read

Objectives​

  • Deploy watsonx.ai on self-managed AWS infrastructure.

Accomplishments​

  • Established this GitHub repository as single source of truth for architecture, IaC, and documentation to collaborate with stakeholders.
  • Developed draft CloudFormation template to provision AWS resources.
  • Started incorporating STS into CloudFormation.

In Progress​

  • Awaiting approval for AWS credits to cover infrastructure costs. Following up to expedite.
  • Finalizing deployment plan and cadence.

Next Steps​

  • Review deployment details in working session with stakeholders.
  • Incorporate additional feedback into documentation and IaC templates.
  • Upon AWS credit approval and stakeholder sign-off, begin provisioning.

Tracking (Issues)​

  • Need confirmation of AWS credit approval.
  • Require sign-off on final CloudFormation template.
  • Align on deployment cadence with customer.