# Prepare SR data

## Data Provenance

Created and provided by the [Center for New Testament Restoration](https://greekcntr.org/home/index.htm).

Licence: [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/legalcode.en).

Data source: [Statistical Restoration Greek New Testament (GitHub)](https://github.com/Center-for-New-Testament-Restoration/SR)

Bunning, Alan, ed. *Statistical Restoration Greek New Testament.* Center for New Testament Restoration. 2022.

[SR Introduction.pdf](https://github.com/Center-for-New-Testament-Restoration/SR/blob/main/SR%20Introduction.pdf)

## Data preparation

This script preprocesses the data to generate two output files:
- A text file containing the complete text as one continuous string without line breaks.
- A text file where each line represents a single verse, including its reference and content.

In [1]:
import unicodedata
import json

def normalize(string, chars_to_remove=None):
    """
    Normalize the input string by converting it to lowercase, removing diacritical marks,
    and optionally removing specified characters from a list.
    
    Args:
        string (str): The input string to normalize
        chars_to_remove (list, optional): List of characters to remove from the string
    
    Returns:
        str: The normalized string
    """
    # Convert to lowercase and normalize apostrophe (to 8125 GREEK KORONIS)
    string = string.lower().replace("’", "᾽").replace("ʼ","᾽")
    # Apply Unicode normalization (NFD) to decompose characters
    string = unicodedata.normalize('NFD', string)
    # Remove non-spacing marks (diacritics)
    string = ''.join(ch for ch in string if unicodedata.category(ch) != 'Mn')
    # Remove specified characters if provided
    if chars_to_remove is not None:
        string = ''.join(ch for ch in string if ch not in chars_to_remove)
    return string

# Location where the source file is located
file_path = r'source/SR-John.txt'

# Lists to collect processed text fragments and tagged lines, and JSON items.
all_lines_parts = []
tagged_lines_parts = []
json_items = []

# Open the source file for reading using UTF-8 encoding.
with open(file_path, 'r', encoding='utf-8') as f:
    for line in f:
        # Split the line into tag and text
        text = normalize(line[8:],'¶˚').rstrip('\n')
        tag = line[:8]
        all_lines_parts.append(text)
        # Format the tagged line and add a newline.
        tagged_lines_parts.append(f"{tag}\t{text}\n")
        # Create a JSON entry for this line.
        json_items.append({
            "tag": tag,
            "text": text
        })

# Join all parts into single strings.
all_lines = ''.join(all_lines_parts)
tagged_lines = ''.join(tagged_lines_parts)

# Write the continuous text to the output file.
with open('SR-John.txt', 'w', encoding='utf-8') as f:
    f.write(all_lines)

# Write the tagged text to a separate output file.
with open('SR-John-tagged.txt', 'w', encoding='utf-8') as f:
    f.write(tagged_lines)

# Write the JSON data to an output file with indentation for readability.
with open('SR-John.json', 'w', encoding='utf-8') as f:
    json.dump(json_items, f, ensure_ascii=False, indent=4)

# checking the results

In [2]:
# Dump the first 300 characters of the continuous text
all_lines[:300]

' εν αρχη ην ο λογος, και ο λογος ην προς τον θεον, και θεος ην ο λογος. ουτος ην εν αρχη προς τον θεον. παντα δι᾽ αυτου εγενετο, και χωρις αυτου εγενετο ουδε εν ο γεγονεν. εν αυτω ζωη ην, και η ζωη ην το φως των ανθρωπων. και το φως εν τη σκοτια φαινει, και η σκοτια αυτο ου κατελαβεν. εγενετο ανθρωπ'

In [3]:
# Check unicode of the first three words
chars="εν αρχη"
for char in chars:
    print(ord(char),unicodedata.name(char)) 

949 GREEK SMALL LETTER EPSILON
957 GREEK SMALL LETTER NU
32 SPACE
945 GREEK SMALL LETTER ALPHA
961 GREEK SMALL LETTER RHO
967 GREEK SMALL LETTER CHI
951 GREEK SMALL LETTER ETA


In [4]:
# Print the first 300 characters of the tagged text
print (tagged_lines[:300])

43001001	 εν αρχη ην ο λογος, και ο λογος ην προς τον θεον, και θεος ην ο λογος.
43001002	 ουτος ην εν αρχη προς τον θεον.
43001003	 παντα δι᾽ αυτου εγενετο, και χωρις αυτου εγενετο ουδε εν ο γεγονεν.
43001004	 εν αυτω ζωη ην, και η ζωη ην το φως των ανθρωπων.
43001005	 και το φως εν τη σκοτια φαινε


In [5]:
# Dump the first two JSON items
print(json.dumps(json_items[:2], ensure_ascii=False, indent=4))

[
    {
        "tag": "43001001",
        "text": " εν αρχη ην ο λογος, και ο λογος ην προς τον θεον, και θεος ην ο λογος."
    },
    {
        "tag": "43001002",
        "text": " ουτος ην εν αρχη προς τον θεον."
    }
]


# 5 - Notebook version details<a class="anchor" id="bullet5"></a>
##### [Back to ToC](#TOC)

<div style="float: left;">
  <table>
    <tr>
      <td><strong>Author</strong></td>
      <td>Tony Jurg</td>
    </tr>
    <tr>
      <td><strong>Version</strong></td>
      <td>1.0</td>
    </tr>
    <tr>
      <td><strong>Date</strong></td>
      <td>25 February 2025</td>
    </tr>
  </table>
</div>