Creating a Knowledge Graph Without LLMs

Knowledge Graph

Knowledge graphs transform complex information into interconnected networks of entities and relationships, mirroring how human brains connect concepts. Their growing adoption is linked to their enhancement of AI systems, particularly in LLMs and RAG pipelines.

In RAG applications, knowledge graphs surpass traditional document retrieval by providing granular, relationship-rich information. Their ability to map cross-document connections creates a unified semantic layer, enabling more contextually aware retrieval and improved AI responses.

The typical knowledge graph construction involves document chunking, followed by LLM-based entity and relationship extraction from each chunk. This process identifies key entities and establishes connections between them, creating a structured knowledge representation.

While LLM-based knowledge graphs offer high accuracy, there are scenarios where a simpler, more cost-effective approach is needed. Using libraries like spaCy, we can create knowledge graphs without relying on LLMs, making it accessible for projects with limited resources or when quick prototyping is required. Though the accuracy might be lower, this approach provides a practical solution for organizations looking to experiment with knowledge graphs without the computational and financial overhead of LLM implementation.

Pipeline

Initial Setup and SpaCy Loading

First, we’ll import the necessary libraries and load SpaCy’s large English model, which includes word vectors for better language understanding. This sets up our foundation for natural language processing tasks.

import spacy
from typing import List, Dict, Tuple, Set
from collections import defaultdict
from rapidfuzz import fuzz

# Load the large English model which includes word vectors
nlp = spacy.load("en_core_web_lg")

# Example usage
text = """Bilbo Baggins celebrates his birthday and leaves the Ring to Frodo, his heir.
Gandalf suspects it is a Ring of Power and counsels Frodo to take it away."""

# Process the text with SpaCy
doc = nlp(text)

Entity Extraction

Here we’ll identify and extract named entities from the text, ensuring we capture unique entities while preserving their original context and type information.

def extract_entities(doc) -> List[Dict]:
    """Extract and categorize named entities from a SpaCy doc."""
    entities = []
    unique_entities = set()
    
    for ent in doc.ents:
        if ent.text not in unique_entities:
            entity_dict = {
                'text': ent.text,
                'type': ent.label_,
                'start': ent.start_char,
                'end': ent.end_char
            }
            entities.append(entity_dict)
            unique_entities.add(ent.text)
    
    return entities, unique_entities

# Example usage
entities, unique_entities = extract_entities(doc)

Basic Relation Extraction

We’ll extract subject-verb-object relationships from the text, capturing how entities interact with each other through actions and connections.

def extract_basic_relations(doc) -> List[Dict]:
    """Extract subject-verb-object relations from a SpaCy doc."""
    relations = []
    
    for token in doc:
        if token.pos_ == "VERB":
            # Find subjects
            subjects = []
            for subj in token.lefts:
                if subj.dep_ in ("nsubj", "nsubjpass"):
                    subjects.append(subj)
                    # Include conjunctions (e.g., "Sam and Frodo")
                    subjects.extend([child for child in subj.children 
                                  if child.dep_ == "conj"])
            
            # Find objects
            objects = []
            for obj in token.rights:
                if obj.dep_ == "dobj":
                    objects.append(obj)
                    # Include conjunctions
                    objects.extend([child for child in obj.children 
                                 if child.dep_ == "conj"])
                elif obj.dep_ == "prep":  # Handle prepositional objects
                    objects.extend([child for child in obj.children 
                                 if child.dep_ == "pobj"])
            
            # Create relations for each subject-object pair
            for subj in subjects:
                for obj in objects:
                    relation = {
                        'subject': subj.text,
                        'relation': token.lemma_,
                        'object': obj.text,
                        'sentence': doc[token.sent.start:token.sent.end].text
                    }
                    relations.append(relation)
    
    return relations

# Example usage
basic_relations = extract_basic_relations(doc)

Pronoun Resolution

To improve the quality of our knowledge graph, we’ll implement a simple pronoun resolution system that maps pronouns to their most likely antecedent entities.

def resolve_pronouns(doc, entity_positions: List[Tuple]) -> Dict[str, str]:
    """Create a mapping of pronouns to their likely antecedents."""
    pronouns = {"them", "it", "they", "who", "he", "she", "him", "his", "her", "their"}
    pronoun_map = {}
    
    for token in doc:
        if token.text.lower() in pronouns:
            # Find the nearest preceding entity
            previous_entities = [ent for ent in entity_positions 
                               if ent[1] < token.idx]
            if previous_entities:
                # Use the most recent entity as the antecedent
                pronoun_map[token.text] = previous_entities[-1][0]
    
    return pronoun_map

# Example usage
entity_positions = [(ent.text, ent.start_char, ent.end_char) 
                   for ent in doc.ents]
pronoun_map = resolve_pronouns(doc, entity_positions)

Fuzzy Entity Matching

To handle variations in entity names, we’ll implement fuzzy matching that can identify similar entities and standardize their references.

def fuzzy_match_entity(text: str, entities: set, threshold: float = 0.9) -> str:
    """Match text to the most similar entity using fuzzy matching."""
    best_match = None
    best_ratio = 0
    text_lower = text.lower()
    
    for entity in entities:
        entity_lower = entity.lower()
        
        # Try exact word matching first
        if text_lower in entity_lower.split() or entity_lower in text_lower.split():
            if len(text) > 3:  # Avoid matching very short words
                return entity
        
        # Try partial string matching
        if text_lower in entity_lower or entity_lower in text_lower:
            if len(text) > 3:
                return entity
        
        # Calculate token sort ratio for word order differences
        ratio = fuzz.token_sort_ratio(text_lower, entity_lower) / 100.0
        
        # If needed, try partial ratio for substring matches
        if ratio < threshold:
            ratio = fuzz.partial_ratio(text_lower, entity_lower) / 100.0
            
        if ratio > threshold and ratio > best_ratio:
            best_match = entity
            best_ratio = ratio
    
    return best_match if best_match else text

# Example usage
test_text = "Baggin"  # Misspelled name
matched_entity = fuzzy_match_entity(test_text, unique_entities)

Extracting Relations From Text

Finally, we’ll combine all our components into a single pipeline that processes text and generates a complete knowledge graph representation.

def process_text_pipeline(text: str) -> Dict:
    """Process text through the complete pipeline."""
    # Step 1: Process with SpaCy
    doc = nlp(text)
    
    # Step 2: Extract entities
    entities, unique_entities = extract_entities(doc)
    
    # Step 3: Get entity positions for pronoun resolution
    entity_positions = [(ent['text'], ent['start'], ent['end']) 
                       for ent in entities]
    
    # Step 4: Create pronoun mapping
    pronoun_map = resolve_pronouns(doc, entity_positions)
    
    # Step 5: Extract basic relations
    relations = extract_basic_relations(doc)
    
    # Step 6: Resolve pronouns and fuzzy match entities in relations
    resolved_relations = []
    for rel in relations:
        resolved_rel = rel.copy()
        
        # Resolve subject
        if rel['subject'].lower() in pronoun_map:
            resolved_rel['subject'] = pronoun_map[rel['subject'].lower()]
        else:
            resolved_rel['subject'] = fuzzy_match_entity(
                rel['subject'], unique_entities
            )
        
        # Resolve object
        if rel['object'].lower() in pronoun_map:
            resolved_rel['object'] = pronoun_map[rel['object'].lower()]
        else:
            resolved_rel['object'] = fuzzy_match_entity(
                rel['object'], unique_entities
            )
        
        resolved_relations.append(resolved_rel)
    
    return {
        'entities': entities,
        'relations': resolved_relations,
        'pronoun_map': pronoun_map
    }

# Example usage of the complete pipeline
text = """
Bilbo Baggins celebrates his birthday and leaves the Ring to Frodo, his heir. 
Gandalf suspects it is a Ring of Power and counsels Frodo to take it away.
"""

result = process_text_pipeline(text)

Creating the Knowledge Graph

First, we’ll create a class to handle our knowledge graph operations using NetworkX:

import networkx as nx
import matplotlib.pyplot as plt
from typing import List, Dict
import logging

class LORKnowledgeGraph:
    def __init__(self):
        """Initialize the NetworkX graph with multi-directed edges."""
        self.G = nx.MultiDiGraph()
        logging.basicConfig(level=logging.INFO)
        self.logger = logging.getLogger(__name__)

    def create_entity_node(self, entity: str):
        """Create a node with sanitized label."""
        label = entity.replace(' ', '_').replace('-', '_').replace("'", '')
        if not label[0].isalpha():
            label = 'N_' + label
        self.G.add_node(label, name=entity)
        
    def create_relationship(self, subject: str, relation: str, object: str):
        """Create a relationship between two nodes."""
        subject_label = subject.replace(' ', '_').replace('-', '_').replace("'", '')
        object_label = object.replace(' ', '_').replace('-', '_').replace("'", '')
        
        if not subject_label[0].isalpha():
            subject_label = 'N_' + subject_label
        if not object_label[0].isalpha():
            object_label = 'N_' + object_label
            
        relation_type = relation.upper().replace(' ', '_')
        self.G.add_edge(subject_label, object_label, relation=relation_type)

    def build_from_relations(self, relations: List[Dict]):
        """Build the graph from extracted relations."""
        for relation in relations:
            self.create_entity_node(relation['subject'])
            self.create_entity_node(relation['object'])
            self.create_relationship(
                relation['subject'],
                relation['relation'],
                relation['object']
            )

Now, let’s extend our pipeline to process multiple text segments and build a knowledge graph:

def process_multiple_texts(texts: List[str]) -> nx.MultiDiGraph:
    """Process multiple text segments and create a knowledge graph."""
    # Initialize the knowledge graph
    kg = LORKnowledgeGraph()
    
    # Process each text segment
    for text in texts:
        # Process text through our existing pipeline
        result = process_text_pipeline(text)
        
        # Add extracted relations to the knowledge graph
        kg.build_from_relations(result['relations'])
    
    return kg.G

def visualize_knowledge_graph(G: nx.MultiDiGraph, figsize=(16, 12)):
    """Visualize the knowledge graph."""
    plt.figure(figsize=figsize)
    
    # Calculate node sizing based on degree centrality
    degrees = dict(G.degree())
    node_sizes = [1500 + (degrees[node] * 100) for node in G.nodes()]
    
    # Use Graphviz layout for better visualization
    pos = nx.nx_agraph.graphviz_layout(G, prog='neato')
    
    # Draw the graph with curved edges and node labels
    nx.draw_networkx_edges(G, pos, 
                         edge_color='gray',
                         alpha=0.4,
                         arrows=True,
                         connectionstyle='arc3,rad=0.2')
    
    nx.draw_networkx_nodes(G, pos,
                         node_color='lightblue',
                         node_size=node_sizes,
                         alpha=0.7)
    
    # Add labels
    node_labels = nx.get_node_attributes(G, 'name')
    edge_labels = nx.get_edge_attributes(G, 'relation')
    
    nx.draw_networkx_labels(G, pos, node_labels, font_size=8)
    nx.draw_networkx_edge_labels(G, pos, edge_labels, font_size=6)
    
    plt.title("Knowledge Graph Visualization")
    plt.axis('off')
    return plt

Final Result

Here’s what the final result on multiple chunks of text from Lord of the Rings looks like:

Conclusion

This lightweight approach to knowledge graph construction demonstrates that valuable semantic networks can be created without relying on large language models. While LLM-based approaches may offer higher accuracy and more nuanced relationship extraction, the combination of SpaCy, fuzzy matching, and basic NLP techniques provides a robust foundation for organizations looking to experiment with knowledge graphs. This method not only offers significant cost savings but also provides greater transparency and control over the extraction process. As knowledge graphs continue to evolve as critical components of modern AI systems, having accessible, LLM-free alternatives ensures that organizations of all sizes can harness the power of structured knowledge representation for their specific use cases.

Keyword Strategy guide by KDP Authors for Book sales on Amazon

"Please don't do this: Keyword Dumping"

Knowledge Graph

Pipeline

Conclusion

Post navigation