What We're Up To

What is Text Analytics

A Tool to Build a Semantic Infrastructure: A Platform for Multiple Applications

Text analytics is a broad umbrella term that includes a large array of capabilities and applications, but the basic definition is that it is the use of software and knowledge models to analyze text and extract key information. The two most important functionalities are auto-categorization and data extraction:

  • Auto-categorization identifying the main idea(s) of documents, usually using a taxonomy. This is done with machine learning or semantic rules.
  • Data extraction of entities (people, places, organizations, etc.) and relationships such as companies merging

In addition, most software also offers supplementary functions such as:

  • NLP – underlying functionality to identify parts of speech
  • Text mining – extract patterns of words and phrases including counts and co-locations
  • Auto-summarization – typically selecting 3-5 key sentences.
  • Clustering of co-occurring terms which is sometimes labeled as an automatic taxonomy.
  • Sentiment analysis (characterizing the sentiments expressed in documents) and intent analysis are special cases of auto-categorization.

 

Text Analytics and Artificial Intelligence

Adding More Intelligence to Artificial Intelligence

Artificial Intelligence (AI) has had huge success when dealing with data, but not much success when dealing with language, concepts, and text. This is because AI, even the newest Large Language Models (LLM) such as ChatGPT, is based on patterns in text, not the meaning of that text. Also, as impressive as ChatGPT is, when it comes to enterprise text, ChatGPT has two major limitations: First, it was trained on public information and as we have learned from 20 years of experience, enterprise content and vocabulary is very different. The second limitation is that as fantastic as CHATGPT is, it is often wrong on facts, and there is the well-known propensity to “hallucinate”.

Text analytics, using a semantic AI approach, is what can add the meaning dimension and fix the limitations of LLMs.
There are two major ways of doing text analytics. The first is machine learning (ML). The second is Semantic AI.

Machine learning typically uses neural networks to model documents and the data in those documents. These networks are trained using sets of example documents. The big advantage with ML is, as the name implies, it “learns”, that is, the elements of the neural networks get stronger with use and produce better results. The big disadvantages of ML are lack of transparency, the relatively low accuracy they can achieve, and the amount of content that is needed to train the systems.

Semantic AI uses sets of words and phrases to determine the meaning of documents. Semantic AI builds rules that apply those words and phrases to documents. The big advantages are higher accuracy and transparency, that is, humans can understand why the software assigns a specific tag or extracts specific data. The disadvantages are its lack of learning which means it must be updated periodically.

It is often cited that ML is easier and quicker to develop but the reality is that selecting good example documents is difficult and time consuming, especially when many thousands or 100’s of thousands of good example documents are needed.

The current best practice is to combine ML and Semantic AI to get the best of both worlds.
The best way to learn more about text analytics is, of course, to buy my book, Deep Text.