# Prepare N1904 data

## Data Provenance

Eberhard Nestle’s 1904 Greek New Testment (British Foreign Bible Society 1904)

Data source: [macula-greek](https://github.com/biblicalhumanities/Nestle1904) via Text Fabric N1904-TF dataset.

License (of the source text): Public Domain

## Data preparation

This script preprocesses the data to generate two output files:
- A text file containing the complete text as one continuous string without line breaks.
- A text file where each line represents a single verse, including its reference and content.
- Mapping of verse to word-node (in the N1904-TF).

## Extract from Text-Fabric

In [1]:
# Loading the Text-Fabric code
# Note: it is assumed Text-Fabric is installed in your environment
from tf.fabric import Fabric
from tf.app import use

In [2]:
# load the N1904 app and data
N1904 = use ("CenterBLC/N1904", version="1.0.0", hoist=globals())

**Locating corpus resources ...**

Name,# of nodes,# slots / node,% coverage
book,27,5102.93,100
chapter,260,529.92,100
verse,7944,17.34,100
sentence,8011,17.2,100
group,8945,7.01,46
clause,42506,8.36,258
wg,106868,6.88,533
phrase,69007,1.9,95
subphrase,116178,1.6,135
word,137779,1.0,100


Display is setup for viewtype [syntax-view](https://github.com/CenterBLC/N1904/blob/main/docs/syntax-view.md#start)

See [here](https://github.com/CenterBLC/N1904/blob/main/docs/viewtypes.md#start) for more information on viewtypes

In [3]:
verseQuery = '''
verse book=John
'''
verseResults = N1904.search(verseQuery)

  0.01s 879 results


In [4]:
import unicodedata
import json

def normalize(string, chars_to_remove=None):
    """
    Normalize the input string by converting it to lowercase, removing diacritical marks,
    and optionally removing specified characters from a list.
    
    Args:
        string (str): The input string to normalize
        chars_to_remove (list, optional): List of characters to remove from the string
    
    Returns:
        str: The normalized string
    """
    # Convert to lowercase and normalize apostrophe (to 8125 GREEK KORONIS)
    string = string.lower().replace("’", "᾽").replace("ʼ","᾽")
    
    # Apply Unicode normalization (NFD) to decompose characters
    string = unicodedata.normalize('NFD', string)
    
    # Remove non-spacing marks (diacritics)
    string = ''.join(ch for ch in string if unicodedata.category(ch) != 'Mn')
    
    # Remove specified characters if provided
    if chars_to_remove is not None:
        string = ''.join(ch for ch in string if ch not in chars_to_remove)
    return string

# Lists to collect processed text fragments and tagged lines, and JSON items.
all_lines_parts = []
tagged_lines_parts = []
json_items = [] 

# Process each verse node from the collection
for verseNode in verseResults:
    # Extract book, chapter, and verse information from the current node.
    book, chapter, verse = T.sectionFromNode(verseNode[0])
    
    # Retrieve the text content of the verse from the node and normalize it.
    line_content = normalize(T.text(verseNode[0]),'(—)')

    # Append the verse's text to the list of all lines (with a preceding space for separation).
    all_lines_parts.append(' ' + line_content)

    # Create the tag using chapter and verse numbers.
    tag = f"43{int(chapter):03}{int(verse):03}"
    
    # Build a tagged line:
    tagged_lines_parts.append(f"{tag}\t{line_content}")

    # Build the JSON entry with both the tag and the text.
    json_items.append({
        "tag": tag,
        "text": line_content
    })

# Join the list of untagged lines into a single string.
all_lines = ''.join(all_lines_parts)

# Join the tagged lines with a newline separator and append a final newline.
tagged_lines = "\n".join(tagged_lines_parts) + "\n"

# Write the continuous text (all lines) to the output file.
with open('N1904-John.txt', 'w', encoding='utf-8') as f:
    f.write(all_lines)

# Write the tagged text to a separate output file.
with open('N1904-John-tagged.txt', 'w', encoding='utf-8') as f:
    f.write(tagged_lines)

# Write the JSON data to a file with indentation for readability.
with open('N1904-John.json', 'w', encoding='utf-8') as f:
    json.dump(json_items, f, ensure_ascii=False, indent=4)

## Checking the results

In [5]:
# Dump the first 300 characters of the continuous text
all_lines[:300]

' εν αρχη ην ο λογος, και ο λογος ην προς τον θεον, και θεος ην ο λογος.  ουτος ην εν αρχη προς τον θεον.  παντα δι᾽ αυτου εγενετο, και χωρις αυτου εγενετο ουδε εν ο γεγονεν.  εν αυτω ζωη ην, και η ζωη ην το φως των ανθρωπων.  και το φως εν τη σκοτια φαινει, και η σκοτια αυτο ου κατελαβεν.  εγενετο α'

In [6]:
# Check unicode of some of the words
import unicodedata
chars="εν παντα δι᾽"
for char in chars:
    print(ord(char),unicodedata.name(char)) 

949 GREEK SMALL LETTER EPSILON
957 GREEK SMALL LETTER NU
32 SPACE
960 GREEK SMALL LETTER PI
945 GREEK SMALL LETTER ALPHA
957 GREEK SMALL LETTER NU
964 GREEK SMALL LETTER TAU
945 GREEK SMALL LETTER ALPHA
32 SPACE
948 GREEK SMALL LETTER DELTA
953 GREEK SMALL LETTER IOTA
8125 GREEK KORONIS


In [7]:
# Print the first 300 characters of the tagged text
print (tagged_lines[:300])

43001001	εν αρχη ην ο λογος, και ο λογος ην προς τον θεον, και θεος ην ο λογος. 
43001002	ουτος ην εν αρχη προς τον θεον. 
43001003	παντα δι᾽ αυτου εγενετο, και χωρις αυτου εγενετο ουδε εν ο γεγονεν. 
43001004	εν αυτω ζωη ην, και η ζωη ην το φως των ανθρωπων. 
43001005	και το φως εν τη σκοτια φαινει


In [8]:
# Dump the first two JSON items
print(json.dumps(json_items[:2], ensure_ascii=False, indent=4))

[
    {
        "tag": "43001001",
        "text": "εν αρχη ην ο λογος, και ο λογος ην προς τον θεον, και θεος ην ο λογος. "
    },
    {
        "tag": "43001002",
        "text": "ουτος ην εν αρχη προς τον θεον. "
    }
]


## Create word-node to verse mapping

In [9]:
import json

verseQuery = '''
verse book=John
'''
verseResults = N1904.search(verseQuery)

output_data = []
wordIndex=1
for (verse,) in verseResults:
    # Get the number of words in the verse.
    verseLength = len(L.d(verse, "word"))
    # Retrieve the book, chapter, and verse number.
    book, chapter, verseNum = T.sectionFromNode(verse)
    location = f"{book} {chapter}:{verseNum}"
    # format: 43001001 for John 1:1
    tag=f"43{chapter:03}{verseNum:03}"
    start = wordIndex
    wordIndex += verseLength
    end = wordIndex - 1

    # Create a dictionary for this verse.
    entry = {"location": location, "tag": tag, "start": start, "end": end}
    output_data.append(entry)

    # Optional: print the mapping for debugging.
    # print(f"{location} / {tag} - {start} {end}")

# Write the list of verse mappings to a JSON file.
with open("node2verse.json", "w", encoding="utf-8") as out_file:
    json.dump(output_data, out_file, ensure_ascii=False, indent=2)

  0.01s 879 results


# 5 - Notebook version details<a class="anchor" id="bullet5"></a>
##### [Back to ToC](#TOC)

<div style="float: left;">
  <table>
    <tr>
      <td><strong>Author</strong></td>
      <td>Tony Jurg</td>
    </tr>
    <tr>
      <td><strong>Version</strong></td>
      <td>1.0</td>
    </tr>
    <tr>
      <td><strong>Date</strong></td>
      <td>25 February 2025</td>
    </tr>
  </table>
</div>