# Prepare Tishendorf data

## Data providence

The New Testament in Koine Greek, based on Tischendorf's 8th edition
Public Domain
Language: Ελληνικά (Greek, Ancient)
Dialect: Koine
Translation by: Tischendorf, etc.

Tischendorf's 8th edition Greek New Testament with morphological tags Version 2.7 Based on G. Clint Yale's Tischendorf text and on Dr. Maurice A. Robinson's Public Domain Westcott-Hort text Edited by Ulrik Sandborg-Petersen This text and its analysis are in the Public Domain. Copy freely.

Data source: [ebible.org](https://ebible.org/details.php?id=grc-tisch)

## Data preparation

This script preprocesses the data to generate two output files:
- A text file containing the complete text as one continuous string without line breaks.
- A text file where each line represents a single verse, including its reference and content.

In [1]:
import os
import unicodedata
import json

def normalize(string, chars_to_remove=None):
    """
    Normalize the input string by converting it to lowercase, removing diacritical marks,
    and optionally removing specified characters from a list.
    
    Args:
        string (str): The input string to normalize
        chars_to_remove (list, optional): List of characters to remove from the string
    
    Returns:
        str: The normalized string
    """
    # Convert to lowercase and normalize apostrophe (to 8125 GREEK KORONIS)
    string = string.lower().replace("’", "᾽").replace("ʼ","᾽")
    # Apply Unicode normalization (NFD) to decompose characters
    string = unicodedata.normalize('NFD', string)
    # Remove non-spacing marks (diacritics)
    string = ''.join(ch for ch in string if unicodedata.category(ch) != 'Mn')
    # Remove specified characters if provided
    if chars_to_remove is not None:
        string = ''.join(ch for ch in string if ch not in chars_to_remove)
    return string

# Directory where the source files are located
directory = r'source'
prefix = 'grc-tisch_073_JHN'

# Get all filenames in the directory that start with the prefix and sort them alphabetically
file_list = sorted(
    f for f in os.listdir(directory)
    if f.startswith(prefix) and os.path.isfile(os.path.join(directory, f))
)

# Lists to collect processed text fragments and tagged lines, and JSON items.
all_line_parts = []
tagged_line_parts = []
json_items = []

chapter = 0
for filename in file_list:
    chapter += 1
    file_path = os.path.join(directory, filename)
    with open(file_path, 'r', encoding='utf-8') as file:
        verse = 0
        for line in file:
            verse += 1  # Increment the verse counter for each line
            # Remove the end-of-line character and unwanted characters
            line_content = normalize(line,'﻿').rstrip('\n')
            all_line_parts.append(line_content)
            # Generate the tag using chapter and verse numbers.
            tag = f"430{chapter:02}0{verse:02}"
            tagged_line_parts.append(f"{tag}\t{line_content}\n")
            # Append a JSON entry for this line.
            json_items.append({
                "tag": tag,
                "text": line_content
            })

# Combine all text fragments into a single continuous string.
all_lines = ''.join(all_line_parts)
tagged_lines = ''.join(tagged_line_parts)

# Write the continuous text to an output file.
with open('TISCH-John.txt', 'w', encoding='utf-8') as f:
    f.write(all_lines)
    
# Write the tagged text to a separate output file.
with open('TISCH-John-tagged.txt', 'w', encoding='utf-8') as f:
    f.write(tagged_lines)

# Write the JSON data to an output file with indentation for readability.
with open('TISCH-John.json', 'w', encoding='utf-8') as f:
    json.dump(json_items, f, ensure_ascii=False, indent=4)

# checking the results

In [2]:
# Dump the first 300 characters of the continuous text
all_lines[:300]

'εν αρχη ην ο λογος, και ο λογος ην προς τον θεον, και θεος ην ο λογος. ουτος ην εν αρχη προς τον θεον. παντα δι᾽ αυτου εγενετο, και χωρις αυτου εγενετο ουδε εν ο γεγονεν εν αυτω ζωη εστιν, και η ζωη ην το φως των ανθρωπων. και το φως εν τη σκοτια φαινει, και η σκοτια αυτο ου κατελαβεν. εγενετο ανθρω'

In [3]:
# Check unicode of some of the words
import unicodedata
chars="εν αρχη δι᾽"
for char in chars:
    print(ord(char),unicodedata.name(char)) 

949 GREEK SMALL LETTER EPSILON
957 GREEK SMALL LETTER NU
32 SPACE
945 GREEK SMALL LETTER ALPHA
961 GREEK SMALL LETTER RHO
967 GREEK SMALL LETTER CHI
951 GREEK SMALL LETTER ETA
32 SPACE
948 GREEK SMALL LETTER DELTA
953 GREEK SMALL LETTER IOTA
8125 GREEK KORONIS


In [4]:
# Print the first 300 characters of the tagged text
print (tagged_lines[:300])

43001001	εν αρχη ην ο λογος, και ο λογος ην προς τον θεον, και θεος ην ο λογος. 
43001002	ουτος ην εν αρχη προς τον θεον. 
43001003	παντα δι᾽ αυτου εγενετο, και χωρις αυτου εγενετο ουδε εν ο γεγονεν 
43001004	εν αυτω ζωη εστιν, και η ζωη ην το φως των ανθρωπων. 
43001005	και το φως εν τη σκοτια φαιν


In [5]:
# Dump the first two JSON items
print(json.dumps(json_items[:2], ensure_ascii=False, indent=4))

[
    {
        "tag": "43001001",
        "text": "εν αρχη ην ο λογος, και ο λογος ην προς τον θεον, και θεος ην ο λογος. "
    },
    {
        "tag": "43001002",
        "text": "ουτος ην εν αρχη προς τον θεον. "
    }
]


# 5 - Notebook version details<a class="anchor" id="bullet5"></a>
##### [Back to ToC](#TOC)

<div style="float: left;">
  <table>
    <tr>
      <td><strong>Author</strong></td>
      <td>Tony Jurg</td>
    </tr>
    <tr>
      <td><strong>Version</strong></td>
      <td>1.0</td>
    </tr>
    <tr>
      <td><strong>Date</strong></td>
      <td>25 February 2025</td>
    </tr>
  </table>
</div>