# Create a JSON Knowledge Graph representing a Text-Fabric dataset (N1904-TF)

## Table of content (ToC)<a class="anchor" id="TOC"></a>
* <a href="#bullet1">1 - Introduction</a>
* <a href="#bullet2">2 - Load the TF dataset</a>
* <a href="#bullet3">3 - Run part of the Doc4TF code</a>
* <a href="#bullet4">4 - Run the extra code</a>
* <a href="#bullet5">5 - The result: a JSON Knowledge Graph</a>
* <a href="#bullet6">6 - Notebook version details</a>

# 1 - Introduction <a class="anchor" id="bullet1"></a>
##### [Back to ToC](#TOC)

In this notebook we will create the bare (JSON) Knowlede Graph. To create the source dictionairy we will re-use part of the code I created for [Doc4TF](https://github.com/tonyjurg/Doc4TF).

## 2 - Load the TF dataset <a class="anchor" id="bullet2"></a>
##### [Back to ToC](#TOC)

In [1]:
from tf.app import use
from collections import defaultdict
import json

# Load the N1904 Text-Fabric dataset
A = use('CenterBLC/N1904', version='1.0.0', hoist=globals())

**Locating corpus resources ...**

Name,# of nodes,# slots / node,% coverage
book,27,5102.93,100
chapter,260,529.92,100
verse,7944,17.34,100
sentence,8011,17.2,100
group,8945,7.01,46
clause,42506,8.36,258
wg,106868,6.88,533
phrase,69007,1.9,95
subphrase,116178,1.6,135
word,137779,1.0,100


Display is setup for viewtype [syntax-view](https://github.com/CenterBLC/N1904/blob/main/docs/syntax-view.md#start)

See [here](https://github.com/CenterBLC/N1904/blob/main/docs/viewtypes.md#start) for more information on viewtypes

# 3 - Run part of the Doc4TF code <a class="anchor" id="bullet3"></a>
##### [Back to ToC](#TOC)

In [2]:
verbose=False
tableLimit=10

# Initialize an empty dictionary to store feature data
featureDict = {}
import time
overallTime = time.time()

def getFeatureDescription(metaData):
    """
    This function looks for the 'description' key in the metadata dictionary. If the key is found,
    it returns the corresponding description. If the key is not present, it returns a default 
    message indicating that no description is available.

    Parameters:
       metaData (dict): A dictionary containing metadata about a feature.

    Returns:
       str: The description of the feature if available, otherwise a default message.
    """
    return metaData.get('description', "No feature description")

def setDataType(metaData):
    """
    This function checks for the 'valueType' key in the metadata. If the key is present, it
    returns 'String' if the value is 'str', and 'Integer' for other types. If the 'valueType' key
    is not present, it returns 'Unknown'.

    Parameters:
       metaData (dict): A dictionary containing metadata, including the 'valueType' of a feature.

    Returns:
       str: A string indicating the determined data type ('String', 'Integer', or 'Unknown').
    """
    if 'valueType' in metaData:
        return "String" if metaData["valueType"] == 'str' else "Integer"
    return "Unknown"

def processFeature(feature, featureType, featureMethod):
    """
    Processes a given feature by extracting metadata, description, and data type, and then
    compiles frequency data for different node types in a feature dictionary. Certain features
    are skipped based on their type. The processed data is added to a global feature dictionary.

    Parameters:
       feature (str): The name of the feature to be processed.
       featureType (str): The type of the feature ('Node' or 'Edge').
       featureMethod (function): A function to obtain feature data.

    Returns:
       None: The function updates a global dictionary with processed feature data and does not return anything.
    """
    
    # Obtain the meta data
    featureMetaData = featureMethod(feature).meta
    featureDescription = getFeatureDescription(featureMetaData)
    dataType = setDataType(featureMetaData)

    # Initialize dictionary to store feature frequency data
    featureFrequencyDict = {}

    # Skip for specific features based on type
    if not (featureType == 'Node' and feature == 'otype') and not (featureType == 'Edge' and feature == 'oslots'):
        for nodeType in F.otype.all:
            frequencyLists = featureMethod(feature).freqList(nodeType)
            
            # Calculate the total frequency
            if not isinstance(frequencyLists, int):
                frequencyTotal = sum(freq for _, freq in frequencyLists)
            else:
                frequencyTotal = frequencyLists
            
            # Calculate the number of entries
            if not isinstance(frequencyLists, int):
                numberOfEntries = len(frequencyLists)
            else:
                numberOfEntries = 1 if frequencyLists != 0 else 0
            # Check the length of the frequency table
            truncated = True if numberOfEntries > tableLimit else False
                
            if not isinstance(frequencyLists, int):
                if len(frequencyLists)!=0:
                    featureFrequencyDict[nodeType] = {'nodetype': nodeType, 'freq': frequencyLists[:tableLimit], 'total': frequencyTotal, 'truncated': truncated}
            elif isinstance(frequencyLists, int):
                if frequencyLists != 0:
                    featureFrequencyDict[nodeType] = {'nodetype': nodeType, 'freq': [("Link", frequencyLists)], 'total': frequencyTotal, 'truncated': truncated}

    # Add processed feature data to the main dictionary
    featureDict[feature] = {'name': feature, 'descr': featureDescription, 'type': featureType, 'datatype': dataType, 'freqlist': featureFrequencyDict}
    
########################################################
#                     MAIN FUNCTION                    #
########################################################

########################################################
#             Gather general information               #
########################################################

print('Gathering generic details')

# Initialize default values
corpusName = A.appName
liveName = ''
versionName = A.version

# Trying to locate corpus information
if A.provenance:
    for parts in A.provenance[0]: 
        if isinstance(parts, tuple):
            key, value = parts[0], parts[1]
            if verbose: print (f'General info: {key}={value}')
            if key == 'corpus': corpusName = value
            if key == 'version': versionName = value
            # value for live is a tuple
            if key == 'live': liveName=value[1]
if liveName is not None and len(liveName)>1:
    # an URL was found
    pageTitleMD = f'Doc4TF pages for [{corpusName}]({liveName}) (version {versionName})'
    pageTitleHTML = f'<h1>Doc4TF pages for <a href="{liveName}">{corpusName}</a> (version {versionName})</h1>'
else:
    # No URL found
    pageTitleMD = f'Doc4TF pages for {corpusName} (version {versionName})'
    pageTitleHTML = f'<h1>Doc4TF pages for {corpusName} (version {versionName})</h1>'

# Overwrite in case user provided a title
if 'customPageTitleMD_' in globals():
    pageTitleMD = customPageTitleMD
if 'customPageTitleHTML' in globals():
    pageTitleHTML = customPageTitleHTML

    
########################################################
#             Processing node features                 #
########################################################

print('Analyzing Node Features: ', end='')
for nodeFeature in Fall():
    if not verbose: print('.', end='')  # Progress indicator
    processFeature(nodeFeature, 'Node', Fs)
    if verbose: print(f'\nFeature {nodeFeature} = {featureDict[nodeFeature]}\n')  # Print feature data if verbose

########################################################
#             Processing edge features                 #
########################################################

print('\nAnalyzing Edge Features: ', end='')
for edgeFeature in Eall():
    if not verbose: print('.', end='')  # Progress indicator
    processFeature(edgeFeature, 'Edge', Es)
    if verbose: print(f'\nFeature {edgeFeature} = {featureDict[edgeFeature]}\n')  # Print feature data if verbose

########################################################
#             Sorting feature dictionary               #
########################################################

# Sort the feature dictionary alphabetically by keys
sortedFeatureDict = {k: featureDict[k] for k in sorted(featureDict)}

# Print the sorted feature dictionary if verbose
if verbose:
    print("\nSorted Feature Dictionary:")
    for key, value in sortedFeatureDict.items():
        print(f"Feature {key} = {value}")
    
print(f'\nFinished in {time.time() - overallTime:.2f} seconds.')

Gathering generic details
Analyzing Node Features: ........................................................
Analyzing Edge Features: .....
Finished in 19.82 seconds.


# 4 -  Run the extra code <a class="anchor" id="bullet4"></a>
##### [Back to ToC](#TOC)

import json

knowledgeGraph = {
    "nodes": {},
    "edges": []
}

for featName, featInfo in featureDict.items():
    # Determine if "Node" or "Edge" feature
    featureKind = featInfo.get("type", "Node")  # "Node" or "Edge"
    if featureKind.lower() == "edge":
        featureType = "edge_feature"
    else:
        featureType = "node_feature"

    # Build a namespaced key for this feature
    featureKey = f"feature::{featName}"

    # Make sure the feature node is in the graph
    nodeEntry = knowledgeGraph["nodes"].setdefault(featureKey, {
        "type": featureType,
        "valid_on": []
    })

    # Store more metadata about the feature
    nodeEntry["featureName"] = featInfo.get("name", featName)   # e.g. "after"
    nodeEntry["description"] = featInfo.get("descr", "")        # e.g. "material after the end of ..."
    nodeEntry["datatype"]    = featInfo.get("datatype", "")     # e.g. "String"

    # Collect node types from the freqlist
    freqInfo = featInfo.get("freqlist", {})
    for freqKey, freqDict in freqInfo.items():
        # freqKey might be "phrase", "word", etc.
        # freqDict has "nodetype": "phrase" (or "word"), plus "freq", "total", ...
        nodeTypeName = freqDict.get("nodetype", freqKey)

        # Build a namespaced key for this node type
        nodeTypeKey = f"otype::{nodeTypeName}"

        # Make sure that node type is declared
        if nodeTypeKey not in knowledgeGraph["nodes"]:
            knowledgeGraph["nodes"][nodeTypeKey] = {
                "type": "node_type",
                "origName": nodeTypeName
            }

        # Record that this feature is valid on this node type
        if nodeTypeKey not in nodeEntry["valid_on"]:
            nodeEntry["valid_on"].append(nodeTypeKey)

        # Add an edge with frequency detail
        knowledgeGraph["edges"].append({
            "from": featureKey,
            "to": nodeTypeKey,
            "relation": "valid on",
            "freqDetail": freqDict
        })

# Output the JSON
outputPath = "n1904_knowledge_graph.json"
with open(outputPath, "w", encoding="utf-8") as f:
    json.dump(knowledgeGraph, f, indent=2)

print(f"Knowledge graph saved to {outputPath}")

# Summary
numNodeTypes = sum(1 for n, d in knowledgeGraph["nodes"].items() if d["type"] == "node_type")
numFeatures  = sum(1 for n, d in knowledgeGraph["nodes"].items() if d["type"].endswith("_feature"))
numEdges     = len(knowledgeGraph["edges"])
print(f"  - Node types: {numNodeTypes}")
print(f"  - Features:   {numFeatures}")
print(f"  - Edges:      {numEdges}")

# 5 - The result: a JSON Knowledge Graph <a class="anchor" id="bullet5"></a>
##### [Back to ToC](#TOC)

The resulting JSON is the actual Knowledge Graph which will be used as input for the [other notebook](generate_cytoscape_html.ipynb).

# 6 - Notebook version details<a class="anchor" id="bullet6"></a>
##### [Back to ToC](#TOC)

<div style="float: left;">
  <table>
    <tr>
      <td><strong>Author</strong></td>
      <td>Tony Jurg</td>
    </tr>
    <tr>
      <td><strong>Version</strong></td>
      <td>1.1</td>
    </tr>
    <tr>
      <td><strong>Date</strong></td>
      <td>3 April 2025</td>
    </tr>
  </table>
</div>