Poor Training Sets
This weak link affects basic machine learning (ML) and the newer GPT and LLMs. In fact, it impacts virtually every attempt to process and utilize unstructured text – from taxonomy and knowledge graph building to search and auto-tagging and customer and business intelligence.
The culprit? Poor and inaccurate training sets (actually any set of targeted documents). If ever there was a poster child for Garbage In, Garbage Out, it is the attempt to build an application using some of the collections of badly selected training sets I’ve seen. This is true whether you are doing ML, building semantic AI categorization rules, or building your own custom LLM. They all depend on starting with good examples/training sets. In the case of LLMs, the training set is the entire usable corpus – the key word being “usable”.
And therein lies the problem.
What Doesn’t Work
What is the problem you might ask? Just ask your SMEs or taxonomists to pick some good example documents or specify what content goes into the LLM. Sounds easy – But:
Typically, SMEs and taxonomists know too much or too little.
Too much: This often leads to classic distinctions without a difference, category differences that only an expert can understand. The result for regular users is usually simply confusion.
Too little: SMEs or even taxonomists are rarely equally knowledgeable about all the categories of content and so you get very uneven quality of examples which impacts all attempts to utilize these training sets.
Why not avoid the problems and just throw all your content into one big bin and let your software figure it all out? After all, isn’t that what ChatGPT did? Not really. ChatGPT also hired hordes of low-paid humans to fine tune their LLM content or “training set”. Also, Enterprise AI (and lots of other applications) require answers that are more precise and more in-depth than you get with simply adding all your enterprise content into an LLM.
On the other hand, you can just do what a lot of organizations have done for years and just have your search engine index all your content and avoid training sets all together. If that is your approach, I suggest you look at the satisfaction levels for enterprise search engines. A lot of unhappy users.
So, if you want to have a useful LLM or search engine or BI/CI application or want to do any kind of analysis of your unstructured content, you need to be very careful what content you use to build your app. Which means training sets in the broadest sense of the word. However, getting good training sets is not easy.
Training Sets Woes
Over the years, I’ve seen just about every problem with developing good training sets. Here are a few of the most common ones.
- Wrong and Wildly Wrong Examples – Users Flee
- Cost to Fix Wrong Examples – the Cost That Keeps on Giving
- Different Levels of Fitness of Examples- Good, Bad, Ugly
- Unbalanced Number of Source Documents – the Many Swamp the Few
- Consistency – Humans Disagree – a lot
- What not to Include – As Important as What to Include
- Documents are Not Simple – Need a Content Structure Model
So, does all this mean that we are doomed to low accuracy and high costs in an attempt to utilize unstructured text in AI, search, or other applications?
No, but the way forward is not simple.
General vs. Specific
Some LLMs avoid most of these issues by just dumping all the text it can find which is fine for a GenAI that provides general answers but, in the enterprise, we want more specific and in-depth answers often involving complex reasoning, something GPT is still not that good at.
It would be nice if there was a simple solution. A new software product, a new technique, an AI that could build itself. But the reality is more complicated.
The real solution? A multi-path text analytics semi-automated process that combines the strengths of each component. The basic components are human SMEs, ML, auto-categorization rules, search, prompt engineering, and a team of experienced content curators. What you get is high quality training sets and smart applications.
The KAPS Group has developed a set of techniques to semi-automate the development of highly accurate training sets. These techniques can be done as part of an auto-tagging project which can feed a new and improved search engine and multiple other applications such as customer intelligence or business intelligence or other analytical applications using basic text analytics functionality of auto-categorization and data extraction. They can also be used to build enterprise LLMs taking advantage of our years of experience curating content.
Who should learn about these new techniques? Anyone who is doing or thinking about doing any of the following:
- Develop an enterprise LLM and GPT application.
- Develop or improve a new search engine application
- Develop an auto-tagging categorization capability
- Develop or improve a sentiment analysis application
- Develop or improve a customer or business intelligence application
- OK – any application that uses unstructured text
If this sounds like something that you would like to learn more about, send me an email (firstname.lastname@example.org) to set up a call to discuss. You’ll also receive a white paper detailing key ideas of this approach.