What Is Text Analytics

What is Text Analytics

Semantic AI: A Platform for Multiple Applications

Text analytics, also called semantic AI, is a broad umbrella term that includes a large array of capabilities and applications, but the basic definition is that it is the use of software and knowledge models to analyze text and extract key information. The two most important functionalities are auto-categorization and data extraction:

Auto-categorization identifying the main idea(s) of documents, usually using a taxonomy. This is done with machine learning or semantic rules.
Data extraction of entities (people, places, organizations, etc.) and relationships such as companies merging

In addition, most software also offers supplementary functions such as:

NLP – underlying functionality to identify parts of speech
Text mining – extract patterns of words and phrases including counts and co-locations
Auto-summarization – typically selecting 3-5 key sentences.
Clustering of co-occurring terms which is sometimes labeled as an automatic taxonomy.
Sentiment analysis (characterizing the sentiments expressed in documents) and intent analysis are special cases of auto-categorization.

Text Analytics and Artificial Intelligence

Adding More Intelligence to Artificial Intelligence

Up until the launch of ChatGPT Artificial Intelligence’s success has been almost entirely with structured data, not unstructured text. This is because AI, even the newest Large Language Models (LLM) such as ChatGPT, is based on patterns in text, not the meaning of that text.

As a result, amazing as ChatGPT and the other Large Language Models are, they have a number of basic limitations:

They have a tendency to hallucinate.
They are entirely opaque – even their creators don’t know why they say what they do.
They were trained on public information which is very different from content behind an enterprise firewall.
Public information contains a great deal of bias.

Text analytics, as practiced by the KAPS Group, can help with all four limitations.

For help with hallucinations and bias, see Using Text Analytics to fix Fake News .
For help with transparency, a semantic AI approach to auto-categorization can open a window into GPT’s output.
For help with enterprise vocabulary, that is what we have been doing for over 15 years: building text analytics enterprise solutions – see the range of articles and presentations on this site.

Text Analytics Basics

There are two major ways of doing text analytics. The first is machine learning (ML). The second is Semantic AI.

Machine learning typically uses neural networks to model documents and the data in those documents. These networks are trained using sets of example documents. The big advantage with ML is, as the name implies, it “learns”, that is, the elements of the neural networks get stronger with use and produce better results. The big disadvantages of ML are lack of transparency, the relatively low accuracy they can achieve, and the amount of content that is needed to train the systems.

Semantic AI uses sets of words and phrases to determine the meaning of documents. Semantic AI builds rules that apply those words and phrases to documents. The big advantages are higher accuracy and transparency, that is, humans can understand why the software assigns a specific tag or extracts specific data. The disadvantages are its lack of learning which means it must be updated periodically.

It is often cited that ML is easier and quicker to develop but the reality is that selecting good example documents is difficult and time consuming, especially when many thousands or 100’s of thousands of good example documents are needed.

The current best practice is to combine ML and Semantic AI to get the best of both worlds.
The best way to learn more about text analytics is, of course, to buy my book, Deep Text.