There is No Such Thing as Unstructured Text
All text has some structure from simple phrases and sentences to more meaningful and useful structures of paragraphs and, especially, sections of a document. Learning how to use that structure can improve auto-categorization by 20-50%, pinpoint key data, and produce more precise training sets.
Content Structure Models
The key to utilizing these structures is to create a content structure model that not only captures how the document is organized, but also indicates what each section is used for and what weights to assign. Content structure models can be expressed in many forms from a simple spreadsheet to complex knowledge graphs. The choice is often dictated by the intended application.
For auto-categorization, the most important sections are ones that summarize the document such as an Abstract, Executive Summary, or for less formal documents, the first and last paragraphs. These sections represent an author’s explicit statement about what the document is about.
The content structure model assigns weights to each section. These weights can range from infinite (only use terms in those sections) to 0 (ignore all terms in those sections).
Knowing what parts of documents to ignore can be as important as knowing what sections to emphasize as it greatly reduces the noise for both semantic rules (sets of keywords) and machine learning (ML). For example, for auto-categorization, sections like Acknowledgements and See Related should be ignored as they refer to entities outside the document.
[Aside – This is true of many author-supplied keywords as they are often about related topics, not the content of this document. I have usually found that I get better tagging results when I ignore most author-supplied keywords. It’s amazing how much better most authors are at writing summaries than adding keywords.]
Measuring the Impact
In a recent project, I wanted to measure the impact of section rules for a set of categorization rules so I did two runs. Once counting all the terms in the document equally with generic weights (counting terms in the beginning more heavily). Once with section rules that weighted summary-type sections much more heavily. The results were clear:
The takeaway: Content Structure Models can drastically improve auto-tagging accuracy which means vastly superior applications from search to data analysis to training sets for AI.