# Generate a test-set file for Morpheus

## Table of content (ToC)<a class="anchor" id="TOC"></a>
* <a href="#bullet1">1 - Introduction</a>
* <a href="#bullet2">2 - Load N1904-TF with N1904addons</a>
* <a href="#bullet3">3 - Obtain morph, betacode and lemma</a>
* <a href="#bullet4">4 - Now save the test-set file</a>
* <a href="#bullet5">5 - Atribution and footnotes</a>
* <a href="#bullet6">6 - Required libraries</a>
* <a href="#bullet7">7 - Notebook version</a>

# 1 - Introduction <a class="anchor" id="bullet1"></a>
##### [Back to ToC](#TOC)

This Jupyter notebook uses feature [betacode](https://github.com/tonyjurg/N1904addons/blob/main/docs/features/betacode.md) to generate a test set which can be used for Morpheus. Each entry in the test set corresponds to an unique SP-tag value in the N1904-TF dataset for the Greek New Testament. The associated word and its lemma are added to the tag and encoded in Betacode.

So the file contains triplets like:
```txt
# tag \t(ab) word \t(ab)   lemma 
N-GSM        *)ihsou=      )ihsou=s
...
```

# 2 - Load N1904-TF with N1904addons <a class="anchor" id="bullet2"></a>
##### [Back to ToC](#TOC)

Since we want to build our Morpheus related feature set to be compatible with the N1904-TF dataset, we will start with loading that dataset.

In [4]:
# Load the autoreload extension to automatically reload modules before executing code
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [5]:
# Loading the Text-Fabric code
from tf.fabric import Fabric
from tf.app import use

In [6]:
# Load the N1904-TF app and data with the additional features
A = use ("CenterBLC/N1904", version="1.0.0", mod="tonyjurg/N1904addons/tf/", hoist=globals())

**Locating corpus resources ...**

Name,# of nodes,# slots / node,% coverage
book,27,5102.93,100
chapter,260,529.92,100
verse,7944,17.34,100
sentence,8011,17.2,100
group,8945,7.01,46
clause,42506,8.36,258
wg,106868,6.88,533
phrase,69007,1.9,95
subphrase,116178,1.6,135
word,137779,1.0,100


Display is setup for viewtype [syntax-view](https://github.com/CenterBLC/N1904/blob/main/docs/syntax-view.md#start)

See [here](https://github.com/CenterBLC/N1904/blob/main/docs/viewtypes.md#start) for more information on viewtypes

# 3 - Obtain morph, betacode and lemma<a class="anchor" id="bullet3"></a>
##### [Back to ToC](#TOC)

Using the earlier created feature [betacode](https://github.com/tonyjurg/N1904addons/blob/main/docs/features/betacode.md) this is a very easy task. The following script scans every word node in the TF dataset to extract its morphological tag and corresponding beta-code forms. Then it collects a single representative example for each unique tag. The results are stored in a dictionary keyed by morph tag. Each entry holds the tag plus its example word and lemma in beta code.

In [52]:
from collections import defaultdict
import beta_code

# Step 1: Initialize a defaultdict to group morph examples
# Each key will be a full morph tag, mapping to its example data
morphsDict = defaultdict(dict)

# Step 2: Iterate over all word nodes using the otypes feature 
for wordNode in F.otype.s('word'):
    
    # Retrieve the morphological tag associated with this word node
    morph = F.morph.v(wordNode)

    # Skip nodes without a morph tag
    if not morph: 
        continue
    
    # Use setdefault to get or create the inner dict for this morph tag
    # This ensures we only keep one entry per unique full tag
    group = morphsDict.setdefault(morph, {})

    # If we've already recorded an example for this morph tag, skip it
    if morph in group:
        continue

    # Otherwise, store the data for this morph tag
    group[morph] = {
        "morph":         morph,
        "betaCodeWord":  F.betacode.v(wordNode),
        "betaCodeLemma": beta_code.greek_to_beta_code(F.lemma.v(wordNode))
    }

# Now morphsDict contains one example per unique morph tag

In [53]:
# We can now easily check the length of the dictionary (which should be 1055, as we know from earler tests)
print (f'Length={len(morphsDict)}')

Length=1055


In [54]:
# we can also dump the first entry from the dictionary
# get the first (morph_tag, group_dict) pair
first_tag, first_group = next(iter(morphsDict.items()))

# print the tag
print("Morph tag:", first_tag)

# print the whole inner dict for that tag
print("Data:", first_group)

Morph tag: N-NSF
Data: {'N-NSF': {'morph': 'N-NSF', 'betaCodeWord': '*bi/blos', 'betaCodeLemma': 'bi/blos'}}


# 4 - Now save the test-set file <a class="anchor" id="bullet4"></a>
##### [Back to ToC](#TOC)

The following script creates the actual file which can be used to run the tests on Morpheus. The script writes each collected morph‐tag example (as done in the previous section) to a plain text file, with fields separated by tabs. After iterating through the stored examples, it reports how many unique tags were written and the filename used.

In [57]:
# Path to output file
output_path = 'test-set.txt'

# Open the file for writing
with open(output_path, 'w', encoding='utf-8') as f:
    # Loop over each stored example
    for group in morphsDict.values():
        for entry in group.values():
            # Write: morph <TAB> betaCodeWord <TAB> betaCodeLemma <NEWLINE>
            f.write(f"{entry['morph']}\t{entry['betaCodeWord']}\t{entry['betaCodeLemma']}\n")

print(f"Wrote {len(morphsDict)} entries to {output_path}")

Wrote 1055 entries to test-set.txt


# 5 - Footnotes and attribution<a class="anchor" id="bullet5"></a>
##### [Back to ToC](#TOC)

This Jupyter notebook used the following sources for the analysis and implementation:

- [Greek Beta Code standard](https://stephanus.tlg.uci.edu/encoding/BCM.pdf)
- Python package [beta-code-py](https://github.com/perseids-tools/beta-code-py)

# 6 - Required libraries<a class="anchor" id="bullet6"></a>
##### [Back to ToC](#TOC)

Since the scripts in this notebook utilize Text-Fabric, [it requires currently (Apr 2025) Python >=3.9.0](https://pypi.org/project/text-fabric) together with the following libraries installed in the environment:

    beta_code 
    collections

You can install any missing library from within Jupyter Notebook using either`pip` or `pip3`.

# 7 - Notebook version<a class="anchor" id="bullet7"></a>
##### [Back to ToC](#TOC)

<div style="float: left;">
  <table>
    <tr>
      <td><strong>Author</strong></td>
      <td>Tony Jurg</td>
    </tr>
    <tr>
      <td><strong>Version</strong></td>
      <td>1.1</td>
    </tr>
    <tr>
      <td><strong>Date</strong></td>
      <td>15 May 2025</td>
    </tr>
  </table>
</div>