textgraphs / docs /lemma.md
Paco Nathan
A new start
91eaff6

A newer version of the Streamlit SDK is available: 1.36.0

Upgrade

Lemma Graph

This project introduces the notion of a lemma graph as an intermediate representation. Effectively, this provides a kind of cache during the processing of each "chunk" of text. Think of the end result as "enhanced tokenization" for text used to generate graph data elements. Other projects might call this by different names: an "evidence graph" in #wen2023mindmap or a "dynamically growing local KG" in #loganlpgs19.

The lemma graph collects metadata from NLP parsing, entity linking, etc., which generally get discarded in many applications. Therefore the lemma graph becomes rather "noisy", and in most cases would be too big to store across the analysis of a large corpus.

Leveraging this intermediate form, per chunk, collect the valuable information about nodes, edges, properties, probabilities, etc., to aggregate for the document analysis overall.

Consequently, this project explores the use of topological transforms on graphs to enhance representations for graph levels of detail, i.e., being able to understand a graph a varying levels of abstraction. Note that adjacent areas of interest include emerging work on:

  • graph of relations
  • foundation models for KGs

Means for "bootstrapping" a lemma graph with initial semantic relations, allows for "sampling" from a curated KG to enhance the graph algorithms used, e.g., through semantic random walks which allow for incorporating heterogeneous sources and relatively large-scale external KGs. This mechanism also creates opportunities for distributed processing, because the "chunks" of text can follow a task parallel pattern, accumulating the extracted results from each lemma graph into a graph database. Augmenting a KG iteratively over time follows a similar pattern.