Lessons Learned
Cloudformation Template
Functions in Cloudformation Templates
The following code was originally included in the Cloudformation template, however we discovered that it was formatted in a way to caused errors and prevented the creation of the stack:
SubnetsInVPC:
Assertions:
- Assert: !EachMemberIn
- !ValueOfAll
- AWS::EC2::Subnet::Id
- VpcId
- !RefAll 'AWS::EC2::VPC::Id'
AssertDescription: All subnets must in the VPC
To fix this, we replaced the shortcut for a function which is an explamation point !
, with an explicit function denoted as FN::
. This is the updated code:
SubnetsInVPC:
Assertions:
- Assert:
Fn::EachMemberIn:
- Fn::ValueOfAll:
- AWS::EC2::Subnet::Id
- VpcId
- Fn::RefAll: AWS::EC2::VPC::Id
AssertDescription: All subnets must in the VPC
Bootnode Subnet
When we ran the Cloudformation template, we were given an error message:
No default VPC for this user. GroupName is only supported for EC2-Classic and default VPC.
This error was a by-product of running the template without any Public subnets attached. The code block below shows that the boot node requires a public subnet for the SubnetID
.
IamInstanceProfile: !Ref BootnodeInstanceProfile
Tags:
- Key: Name
Value:
!Sub
- "${ClusterName}-bootnode"
- ClusterName: !Ref ClusterName
InstanceType: t3.large
NetworkInterfaces:
- GroupSet:
- !Ref BootnodeSecurityGroup
AssociatePublicIpAddress: true
DeviceIndex: '0'
DeleteOnTermination: true
SubnetId: !Ref PublicSubnet1ID
UserData:
There are two ways to resolve this problem:
- Use public subnets when creating the stack or
- Replace
SubnetId: !Ref PublicSubnet1ID
with a private subnetIDSubnetId: subnet-example5b646
Rewriting Documents & Chunking Experiments
These were some experiments conducted during the course of this POC to evaluate methods for optimizing LLM output quality, table parsing, and alternate chunking methods to optimize NeuralSeek's document retrieval.
Rewriting Documents
From our observation of many documents, we have noticed that it is common to have PDF documents that are dense and complex. The content could sometimes be challenging for humans to process and find the exact information. Moreover, sometimes the answers to the questions are not explicitly stated in the documents. This typically can be solved by the exercise of reformatting and rewriting portions of the documents to improve the documents and chatbot responses.
Chunking
Certain documents contain tables of contents and numbered section headings that make it easily to split the documents into chunks. We experimented with this using certain documents in our knowledge base, splitting documents based on section headings. This can be done either with code, or manually.
Employing this technique, we were able to achieve a significant increase in the amount of times the correct passage from a document was retrieved during a NeuralSeek search. Ensuring that documents are titled correctly is also important for this to work. We only performed manual chunking by hand, but it would be simple to implement a programmatic solution with something like Regex to automatically split documents into parts.
Once documents were chunked, we then removed the full documents from the Watson Discovery collection after replacing them with their split components, to avoid confusing the RAG retrieval.
Recommendation
After our extensive testing, we recommend the following:
-
Use a combination of LLM and curation LLM is not a perfect solution to everything in the enterprise world. Sometimes it could end up in subpar answers and unnecessary costs. Our recommendation is to combine LLM and curation. Curate existing Q&A or the ones that we want the same answer every time first. Then, use LLM to answer any new queries. Every query to LLM is associated with a cost, so if there is an existing Q&A list, then there is no need to spend extra cost to have LLM answer them. LLM can search through the knowledge repository and provide an answer, then we can evaluate the answer and curate if needed.
-
Improve upon existing complex documents The existing documentations contain complex tables and text structures. Avoid having tables and images when possible, and have it in natural language text for LLM to understand. Moreover, rewriting documents in general will help LLM to understand the content more. It is recommended to be more explicit when writing, and avoid section titles that do not provide additional context. In addition, it is better to have acronyms spelled out since LLM may only pull sections of text that only have acronyms without the context what it stands for. These techniques will significantly improve chatbot results.
-
Be specific when querying It is also recommended to provide enough contextual information in questions, so the LLM can understand and find relevant information. Some examples include spell out acronyms, include specific type of data in question, etc. This helps LLM to understand and answer properly.