# GNT word list in BetaCode

## Table of content (ToC)<a class="anchor" id="TOC"></a>
* <a href="#bullet1">1 - Introduction</a>
* <a href="#bullet2">2 - Create list of Greek words in Unicode</a>
* <a href="#bullet3">3 - Analyze Unicode accent storage</a>
* <a href="#bullet4">4 - Convert the word list into betacode</a>
* <a href="#bullet5">5 - Create a JSON dictionairy</a>
* <a href="#bullet6">6 - Atribution and footnotes</a>
* <a href="#bullet7">7 - Required libraries</a>
* <a href="#bullet8">8 - Notebook version</a>


# 1 - Introduction <a class="anchor" id="bullet1"></a>
##### [Back to ToC](#TOC)

This Jupyter notebook takes the [MACULA XML](https://github.com/Clear-Bible/macula-greek) dataset as input to generate a list of all morphemes in the Greek New Testament encoded in BetaCode. This list will be used as input to the Morpheus morphological tagger. 

#  2 - Create list of Greek words in Unicode<a class="anchor" id="bullet2"></a>
##### [Back to ToC](#TOC)

The first step is to compile a list of unique Greek words from the New Testament. This will be done by extracting the text content of the <w> elements from the MACULA XML source data. An example of a <w> element, representing the final word in Matthew 1:1, is shown below:

```xml
        <w ref="MAT 1:1!8"
           after="."
           class="noun"
           type="proper"
           xml:id="n40001001008"
           lemma="Ἀβραάμ"
           normalized="Ἀβραάμ"
           strong="11"
           number="singular"
           gender="masculine"
           case="genitive"
           gloss="of Abraham"
           domain="093001"
           ln="93.7"
           morph="N-PRI"
           unicode="Ἀβραάμ.">Ἀβραάμ</w>
```

In [1]:
import os
import requests
import xml.etree.ElementTree as ET
import re
from pathlib import Path

# GitHub repository details
owner = "tonyjurg"
repo = "Nestle1904LFT"
branch = "main"
path = "resources/xml/20240210"  # Input XML treebank for the Nestle 1904 Greek New Testament

# Base URL for raw file content
rawBaseUrl = f"https://raw.githubusercontent.com/{owner}/{repo}/{branch}/{path}/"

# Option to use local files
useLocal = True  # Set to False to fetch files from GitHub
localInputDir = Path("C:/Users/tonyj/OneDrive/Documents/GitHub/REMA-grammarR-playground/XML-input").resolve()
outputFile = Path("uniqueWords.txt")  # Output file for unique words

def getRateLimit():
    """
    Fetch and display the current GitHub API rate limit status.
    """
    rateLimitUrl = "https://api.github.com/rate_limit"
    response = requests.get(rateLimitUrl)
    response.raise_for_status()
    rateLimit = response.json()["rate"]
    print(f"GitHub API Rate Limit: {rateLimit['remaining']} remaining out of {rateLimit['limit']} requests.")

def getFileList():
    """
    Get the list of XML files either from the GitHub repository or from the local directory.
    """
    if useLocal:
        if not localInputDir.exists():
            raise FileNotFoundError(f"Local directory {localInputDir} does not exist.")
        return sorted(
            file.name for file in localInputDir.glob("*.xml") if re.match(r"^\d{2}-", file.name)
        )
    else:
        getRateLimit()
        apiUrl = f"https://api.github.com/repos/{owner}/{repo}/contents/{path}"
        response = requests.get(apiUrl)
        response.raise_for_status()
        files = response.json()
        return sorted(
            file["name"] for file in files if file["name"].endswith(".xml") and re.match(r"^\d{2}-", file["name"])
        )

def processFile(fileName, uniqueWords):
    """
    Parse and process the content of a single XML file to collect unique words.
    """
    filePath = localInputDir / fileName if useLocal else f"{rawBaseUrl}{fileName}"
    
    if useLocal:
        with filePath.open("rb") as file:
            xmlContent = file.read()
    else:
        response = requests.get(filePath)
        response.raise_for_status()
        xmlContent = response.content

    # Parse the XML file
    try:
        root = ET.fromstring(xmlContent)  # Parse XML content from string
    except Exception as e:
        print(f"Error processing {fileName}: {e}")
        return  # Continue with other files

    for word in root.findall(".//w"):
        wordText = word.text  # get the text inside the `w` tag
        if wordText:
            uniqueWords.add(wordText)

def main():
    try:
        fileNames = getFileList()
        print(f"Found {len(fileNames)} XML files to process.")

        uniqueWords = set()
        for fileName in fileNames:
            try:
                processFile(fileName, uniqueWords)
            except Exception as e:
                print(f"Error processing {fileName}: {e}")

        # Write unique words to the output file
        with outputFile.open("w", encoding="utf-8") as file:
            for word in sorted(uniqueWords):  # Sort alphabetically before saving
                file.write(word + '\n')

        print(f"Unique words saved to {outputFile}.")
    except Exception as e:
        print(f"Error fetching file list or processing files: {e}")

if __name__ == "__main__":
    main()


Found 27 XML files to process.
Unique words saved to uniqueWords.txt.


# 3 - Analyze Unicode accent storage<a class="anchor" id="bullet3"></a>
##### [Back to ToC](#TOC)

The distinction between pre-composed characters and separate accents in Unicode is essential for consistency in text processing, particularly in Greek, where accents convey grammatical and phonetic meaning. Pre-composed characters combine the base letter and accent into a single Unicode point, while separate accents use multiple code points. This difference can affect sorting, searching, and rendering, as systems may treat the two forms differently despite their identical appearance.

The script reads Greek words from `uniqueWords.txt`, checks how their accents are stored, and categorizes them as pre-composed, separate accents, or mixed. Detailed output is stored to a JSON file (accentAnalysis.json), while a short summary is printed on screen.

In [2]:
import unicodedata

# Path to the input file
inputFile = 'uniqueWords.txt'

# Function to check if a word uses pre-composed characters
def checkAccentType(word):
    """
    Determine if a word uses pre-composed characters or separate accent definitions.
    
    Args:
        word (str): The Greek word to check.

    Returns:
        str: "precomposed" if the word uses pre-composed characters,
             "separate accents" if it uses separate accent definitions.
    """
    normalizedNFC = unicodedata.normalize('NFC', word)  # Pre-composed form
    normalizedNFD = unicodedata.normalize('NFD', word)  # Decomposed form

    if word == normalizedNFC:
        return "precomposed"
    elif word == normalizedNFD:
        return "separate accents"
    else:
        return "mixed"

# Read Greek words from the input file
with open(inputFile, 'r', encoding='utf-8') as inFile:
    greekWords = inFile.read().splitlines()

# Analyze each word for accent storage
accentAnalysis = {word: checkAccentType(word) for word in greekWords}

# Print results
precomposedCount = sum(1 for v in accentAnalysis.values() if v == "precomposed")
separateAccentsCount = sum(1 for v in accentAnalysis.values() if v == "separate accents")
mixedCount = sum(1 for v in accentAnalysis.values() if v == "mixed")

print(f"Precomposed: {precomposedCount}")
print(f"Separate accents: {separateAccentsCount}")
print(f"Mixed: {mixedCount}")

# Save the results to a file
outputFile = 'accentAnalysis.json'
import json
with open(outputFile, 'w', encoding='utf-8') as outFile:
    json.dump(accentAnalysis, outFile, ensure_ascii=False, indent=4)

print(f"Accent analysis saved to {outputFile}.")

Precomposed: 19477
Separate accents: 0
Mixed: 0
Accent analysis saved to accentAnalysis.json.


# 4 - Convert the word list into betacode<a class="anchor" id="bullet4"></a>
##### [Back to ToC](#TOC)

This script converts the previous list of Greek words stored in `uniqueWords.txt` into its coresponding Beta Code, a transliteration system for Greek used as input to Morpheus morphological tagger.

The script reads the Greek words, applies the beta_code.greek_to_beta_code function (found on GitHub repository [perseids-tools/beta-code-py](https://github.com/perseids-tools/beta-code-py)) to convert each word, and writes the Beta Code equivalents to `betaCodeWords.txt`.

Using the `-S`  switch (which turns *off* strict interpretation of upper/lower case) seems to allow Morpheus to recognize the form (note the difference between the :raw and :workw tags in the second example):

```text
root@morpheus:/morpheus# echo 'tou=to' | MORPHLIB=stemlib bin/cruncher -d -S

:raw tou=to

:workw tou=to
:lem ou(=tos
:prvb 
:aug1 
:stem tou=to                    indeclform
:suff 
:end     neut nom/voc/acc sg            indeclform      pron_adj1

root@morpheus:/morpheus# echo '*tou=to' | MORPHLIB=stemlib bin/cruncher -d -S

:raw *tou=to

:workw tou=to
:lem ou(=tos
:prvb 
:aug1 
:stem tou=to                    indeclform
:suff 
:end     neut nom/voc/acc sg            indeclform      pron_adj1
```


In [5]:
import beta_code

def capitalizeIfAllCaps(word):
    if word.isupper():  # Check if the word is all uppercase
        return word.capitalize()  # Capitalize only the first letter
    return word  # Leave the word unchanged if it's not all uppercase

# Paths to input and output files
inputFile = 'uniqueWords.txt'       # File containing Greek Unicode words
outputFile = 'gnt_words.txt'    # File to save the converted Beta Code words

# Read Greek words from the input file
with open(inputFile, 'r', encoding='utf-8') as inFile:
    greekWords = inFile.read().splitlines()

# Convert Greek to Beta Code
# note: if we want to get all words in lowercase, we should use casefold() which is
# more robust Unicode lowercase (better than lower() for Greek). 
# In that case we should use:
#    betaCodeWords = [beta_code.greek_to_beta_code(word.casefold()) for word in greekWords]
# However, in this case I will call chruncher with the `-S` switch, which takes care of many 
# problem cases. I will only change all caps words to lower case with just the first letter in caps.
# Now also try to switch off any modification of the words!
# betaCodeWords = [beta_code.greek_to_beta_code(capitalizeIfAllCaps(word)) for word in greekWords]
betaCodeWords = [beta_code.greek_to_beta_code(word) for word in greekWords]

# Write the Beta Code words to the output file
with open(outputFile, 'w', encoding='utf-8') as outFile:
    for word in betaCodeWords:
        outFile.write(word + '\n')

print(f"Converted {len(greekWords)} words to Beta Code and saved to {outputFile}.")


Converted 19477 words to Beta Code and saved to gnt_words.txt.


# 5 - Create a JSON dictionairy<a class="anchor" id="bullet5"></a>
##### [Back to ToC](#TOC)

The following script creates a JSON file where the Greek words are the keys and their corresponding Beta Code representations are the values. This dictionairy assists in translating back the results from the Morpheus lookup (which now can be done in multiple other ways like using the newly created TF feature [betacode](https://github.com/tonyjurg/N1904addons/blob/main/docs/features/betacode.md) or on the fly using the [beta_code-py library](https://github.com/perseids-tools/beta-code-py)).

In [4]:
import beta_code
import json

def capitalizeIfAllCaps(word):
    if word.isupper():  # Check if the word is all uppercase
        return word.capitalize()  # Capitalize only the first letter
    return word  # Leave the word unchanged if it's not all uppercase

# Paths to input and output files
inputFile = 'uniqueWords.txt'       # File containing Greek Unicode words
outputFile = 'betaCodeToWord.json'   # File to save the Greek-to-Beta Code mapping

# Read Greek words from the input file
with open(inputFile, 'r', encoding='utf-8') as inFile:
    greekWords = inFile.read().splitlines()

# Create a dictionary with Greek words as keys and Beta Code as values
wordsBetaCodeMap = {beta_code.greek_to_beta_code(capitalizeIfAllCaps(word)): word for word in greekWords}

# Write the dictionary to a JSON file
with open(outputFile, 'w', encoding='utf-8') as outFile:
    json.dump(wordsBetaCodeMap, outFile, ensure_ascii=False, indent=4)

print(f"Created JSON file with {len(wordsBetaCodeMap)} entries: {outputFile}")


Created JSON file with 19477 entries: betaCodeToWord.json


# 6 - Footnotes and attribution<a class="anchor" id="bullet6"></a>
##### [Back to ToC](#TOC)

The engine of the conversion is provided by the `beta-code-py` library found on GitHub repository [perseids-tools/beta-code-py](https://github.com/perseids-tools/beta-code-py) available under MIT license.

The source data for the conversion are the XML node files representing the macula-greek version of Eberhard Nestle's 1904 Greek New Testament (British Foreign Bible Society 1904). The starting dataset is formatted according to Syntax diagram markup initially prepared by the Asia Bible Society and currently made available by <a href="https://www.biblica.com/" target="_blank">Biblica, Inc</a>. The most recent source data can be found on [GitHub](https://github.com/Clear-Bible/macula-greek/tree/main/Nestle1904/nodes). 

# 7 - Required libraries<a class="anchor" id="bullet7"></a>
##### [Back to ToC](#TOC)

The scripts in this notebook require the following Python libraries to be installed in the environment:

    beta_code 
    json
    os  
    pathlib
    re
    requests
    unicodedata
    xml

You can install any missing library from within Jupyter Notebook using either`pip` or `pip3`.

# 8 - Notebook version<a class="anchor" id="bullet8"></a>
##### [Back to ToC](#TOC)

<div style="float: left;">
  <table>
    <tr>
      <td><strong>Author</strong></td>
      <td>Tony Jurg</td>
    </tr>
    <tr>
      <td><strong>Version</strong></td>
      <td>1.3</td>
    </tr>
    <tr>
      <td><strong>Date</strong></td>
      <td>29 April 2025</td>
    </tr>
  </table>
</div>