This weak link affects basic machine leaning (ML) and the newer GPT and LLMs. In fact, it impacts virtually every attempt to process and utilize unstructured text – from taxonomy and knowledge graph building to search and auto-tagging and customer and business intelligence.
The culprit? Poor and inaccurate training sets (actually any set of targeted, sample documents). If ever there was a poster child for Garbage In, Garbage Out, it is the attempt to build an application using some of the collections of badly selected training sets I’ve seen. This is true whether you are doing ML using vector representations or building semantic AI categorization rules from example documents or building your own custom LLM using the latest parameter-efficient fine-tuning (PEFT), low-rank adaptation (LoRA, or neurosymbolic architecture methods. They all depend on starting with good examples/training sets.
And therein lies the problem.
What Doesn’t Work
What is the problem you might ask? Just ask your SMEs to pick some good example documents. Sounds easy – But:
Typically, SMEs know too much or too little. Too much: On a recent project I asked a SME who was very knowledgeable about the enterprise taxonomy about three categories that were very similar. She provided an excellent, sophisticated explanation of the differences. The problem? No one but another taxonomy expert would ever use those distinctions in the real world of enterprise applications. They were the classic distinctions without a difference. The result for regular users is usually simply confusion.
Too little: One approach is to hire one or more taxonomists and ask them for good examples. Unfortunately, taxonomists rarely are equally knowledgeable about all categories in a taxonomy and so you get very uneven quality of examples which impacts all attempts to utilize these training sets. The problem is worse when you ask a few SMEs to suggest example documents who might know all the areas in general but are missing experience with some categories, a particular problem with new, emerging categories. For example. I’ve seen situations where one category has 100% great examples, but another, chosen by the same person, had 0% even mildly accurate examples.
Why not avoid the problems and just throw all your content into one big bin and let your software figure it all out? After all, isn’t that what ChatGPT did? Not really. ChatGPT also hired hordes of low-paid humans to fine tune their LLM. So, unless, you also have hordes of low-paid humans and/or are satisfied with generic, general answers, this doesn’t work very well. Enterprise AI (and lots of other applications) require answers that are more precise and more in-depth than you get with simply adding all your enterprise content into an LLM.
On the other hand, you can just do what a lot or organizations do and just have your search engine index all your content and avoid training sets all together. If that is your approach, I suggest you look at the satisfaction levels for enterprise search engines that don’t use taxonomies, knowledge graphs or some other form of knowledge structure.
So, if you want to have a useful LLM or search engine or BI/CI application or want to do any kind of analysis of your content, you need to be very careful what content you use to build your app. Which means training sets in the broadest sense of the word. However, getting good training sets is not easy.
Training Sets Woes
Over the years, I’ve seen just about every problem with developing good training sets. Here are a few of the most common ones.
Wrong and Wildly Wrong
It is bad enough when 25% of your training set is wrong (a lot of extra garbage) but even worse is when the example documents are wildly wrong. And an average of 25% wrong typically means that some categories are 100% bad examples. These wildly wrong documents not only degrade the accuracy of any application, they frequently cause users to lose confidence in the whole system or application.
False positives can produce wrong answers regardless of approach – ML, LLM, or semantic rules. And again, how many and even more important how wildly wrong those answers are can lead to people losing confidence in the system. False negatives are even more of an issue as they represent concepts that are not covered by the training sets and missing concepts are particularly difficult to fix. How do you know something is missing? How do you know that the answer that you got is the best answer?
A related issue is the difference in fitness of examples. Typically, sets of documents will contain really good examples, good examples, OK examples, and not very good examples. How do you rank them in your application? Who does the ranking? One method is to segment documents into paragraph or smaller abstract segments. This works well for smaller documents when the categorization task is simple sentiment, but not so well for larger documents that are being analyzed for a variety of concepts.
Another issue is the impact of having some categories with many more examples in your overall document set. On one recent project, we had 45,000 examples of one concept and 1 or 2 examples of other concepts. There are ways of alleviating some of the imbalance, but none of them come without costs. If you select a minimum and only take 20 or 40 examples from each concept, that will still leave a number of concepts with significantly fewer examples. This is particularly an issue for ML approaches and any attempt to build an enterprise custom LLM. In addition, only selecting 20 or 40 examples from concepts that have 45,000 examples, how do you know that the 40 examples contain the most important evidence terms. You don’t without a major effort.
Another issue is that in many ways consistency is as important as accuracy. One of the techniques my group has used when fine tuning semantic rules is to have 3 humans and one AI tag a set of sample documents. What we found on average was that all three humans only agreed with each other about 75% of the time. When applied to developing training sets, this means that if you use SMEs to develop them, you will likely get about 25% wrong examples according to one or more of your SMEs. This does bring up another topic for another blog post, which is how do you measure accuracy? More on that next time.
What Not to Include
Another issue is deciding what not to include. It is typically easy to not include a variety of documents that are clearly inappropriate – often special use or systemic documents. But what about documents that are about multiple topics? For example, a document that is mostly about topic X but contains a significant amount of text about topic Y – should this be included or not? If not, you get a pure example that might not reflect the reality of mixed topics. If yes, there will be a significant amount of text that is not about topic X which will pollute your training sets.
Documents are Not Simple
A large minority or the majority of enterprise documents are about a number of different ideas or concepts. In addition, not all parts of a document are equal and the words and phrases within those parts should not be treated with equal weights.
One method that works well for those more complex documents is to develop a content structure model which weights different parts of the document differently, including excluding some parts. This is a technique we have used to increase accuracy from between 15 to 50%.
In addition to the direct impact of low accuracy, the other major drawback of too many training sets is that it takes time and effort to overcome the poor accuracy. In the same project referenced above we were tasked with building semantic AI categorization rules and we were able to achieve 95%+ accuracy, but it took a lot of extra time and money to achieve that. In that case, the extra time happened to coincide with a major change in the client’s organization which led to the abandonment of the project. It is unlikely that any extra time and money on your project would lead to that drastic a result, but it does significantly decrease any ROI for the project.
So, does all this mean that we are doomed to low accuracy and high costs in an attempt to utilize unstructured text in AI, search, or other applications?
No, but the way forward is not simple.
Generalists vs. Specialists
LLMs avoid most of these issues by just dumping all the text it can find which is fine for a GenAI that provides general answers but in the enterprise, we want more specific and in-depth answers.
The current enterprise methods of building targeted sets of documents all have serious issues.
- Human editors or SMEs– very expensive and humans are very inconsistent
- Using search to find good examples – low accuracy, too many false positives, missing key documents and concepts due to variations in vocabulary
- Unsupervised ML – just as bad or worse than search, particularly missing key concepts
- Supervised ML – this is built on one of the above methods as you need examples first
What they have in common is the attempt to do it all with one technique (specialists), but a better answer is to utilize a number of techniques in combination (generalists).
And while OpenAI did a great job building an LLM for public content, their approach of using hordes of low-paid humans is way too expensive for most organizations. Also, we know that the content and vocabularies behind an enterprise firewall are very different than public content. And, more importantly, the questions asked of enterprise content are usually more precise than the general questions that a ChatGPT can answer.
It would be nice if there was a simple solution. A new software product, a new technique, an AI that could build itself. But the reality is more complicated.
The real solution? A multi-path text analytics semi-automated process that combines the strengths of each component. The basic components are human SMEs, ML, auto-categorization rules, search, Gen AI and using prompt engineering to build training sets, and a team of experienced content curators. While that might sound like overkill and the most expensive way of doing targeted document sets, it is actually less expensive as each component is only asked to do what it is best at and not try to do it all with any one component – and the quality of the set and subsequent applications is way higher.
The KAPS Group (a group of generalists) has developed a set of techniques to semi-automate the development of highly accurate training sets. These techniques can be done as part of an auto-tagging project which can feed a new and improved search engine and multiple other applications such as customer intelligence or business intelligence or other analytical applications using basic text analytics functionality of auto-categorization and data extraction.
The first step in virtually all of our projects is to determine the right combination of techniques for each situation through a series of interviews and preliminary tests.
Our approach is based on combining AI and humans using a model that goes back to Cyborg Chess and incorporates the latest neurosymbolic architectures. Cyborg Chess refers to a little-known coda to a well-known chess match – Gary Kasparov vs. Big Blue. Everyone knows Big Blue won. But not many people outside of the chess world know that in response to his defeat, Gary Kasparov developed an approach that combined computer chess programs with a human grandmaster – Cyborg Chess. The grandmaster provided a theoretical knowledge to supplement the computer’s ability to process millions or more possible moves per second. Cyborg Chess can beat computer chess. And Cyborg training sets can out-perform either all human or all computer approaches.
For more on the combination of text analytics and LLM/GPT, see my article, What is Smarter and Safer than LLM/GPT?
Who should learn about these new techniques? Anyone who is doing or thinking about doing any of the following:
Develop an enterprise LLM and GPT application.
Develop or improve a new search engine application
Develop an auto-tagging categorization capability
Develop or improve a sentiment analysis application
Develop or improve a customer or business intelligence application
OK – any application that uses unstructured text
If this sounds like something that you would like to learn more about, fill out the form to set up a call to discuss.