Log 21 π«
Β· 3 min read
Objectiveβ
Deploy watsonx.ai on self-managed AWS infrastructure for customer software evaluation
Milestonesβ
- Deploy and configuration of boot node to establish a beach-head into the customer AWS environment
- Complete
- Deploy OCP using the documented UPI installation steps
- Complete
- Install Cloud Pak for Data
- In Progress
- Deploy and configure watsonx.ai on self-managed AWS infrastructure on ref environment and document
- In Progress
Today's Accomplishmentsβ
- Successful deployment of CP4D
- Installed Knative
Summaryβ
- Customer has approved required contracts and procedures have been followed to attain an entitlement key
- Updated bootnode IP reference in configuration
- Re-ran install scripts
- MCG Secrets created for Cloud Pak components
- Verify cluster up v4.12.8
- No errors
- Check storage size on nodes, have 5TB disks instead of 500gb as intended
- This was set incorrectly in config, reconfiguring worker nodes with proper (worker-template.yaml)
- Don't need secondary disks on the nodes, NFS will be used instead
- Adding GPU node
- Updating config.sh, gpu_subnet accurate, security groups set properly
- Logging into AWS via aws_sso
- Running add_node.sh to add gpu node (runs for about 10 min)
- Verifying node draining and uncordon node
- Installing nfs provisioner
- Operator install unsuccessful
- Fallback to using helm install
- Downloading helm repo and install
- Tested and verified with test pod that attached to nfs-client storageclass, applying clean up
- Operator install unsuccessful
Decisions and Action Items (DAI)β
- Authorized Instance Topology
Lessons Learnedβ
- Had an issue with βapply-cluster-componentsβ which requires connecting to github to download CASE files. Found a solution in the cpd-cli documentation: use two additioanl flags on the command β--case_download=trueβ and β--from_oci=trueβ which tells cpd-cli to download the CASE files from IBM Open Container Initiative instead of github.
- While running βsetup-instance-topologyβ for knative, received an error regarding storage. Added
β--block_storage_class=${STG_CLASS_BLOCK}β
to the command and it completed successfully.
Next Stepsβ
- License and configure Cloud Pak for Data
- Cloud Pak Considerations
- Security scans needed on container images
- Customer requires on-prem, offline install
- Customer uses their own container registry that might introduce extra effort or compatability issues
- Version compatibility with OpenShift (e.g. 4.10 required and customer has 4.11)
- Supported storage not available
- Multiple cloudpaks on the same cluster
- custom connections to data sources not supported OOTB
- AWS-specific: IAM users required for install/deploy and are not allowed
- OpenShift specific: CoreOS requirement for control nodes
- Automatic updating of Cloud Pak, this can interrupt engagements (solution is to always remove update polling from operators)
- Cloud Pak Considerations
- Deploy watsonx.ai
- Application configurations
- Application validations