Dictionary Sources and Translation Method

This page explains where our dictionary data comes from, how we merge multiple open-source English datasets, and how we generate Spanish and Indonesian glossary entries when full source-language dictionaries do not exist.

Research status as of March 1, 2026.

Open-Source English Dictionary Base

English lexical coverage exists across multiple open datasets for our source languages (Greek, Hebrew, Latin, and Syriac). We ingest and normalize these as the core dictionary layer.

STEPBible TBESG Lexicon
Language coverage: Greek (grc)
How it is used: Primary Greek lexical gloss and dictionary backbone.
License/terms: CC BY (per upstream repository terms).
Source
STEPBible TBESH Lexicon
Language coverage: Hebrew (hbo)
How it is used: Primary Hebrew lexical structure and cross-reference anchors.
License/terms: CC BY (per upstream repository terms).
Source
MACULA Greek
Language coverage: Greek (grc)
How it is used: Morphological and lexical detail for Greek entries.
License/terms: Open-source terms in upstream repository.
Source
MACULA Hebrew
Language coverage: Hebrew (hbo)
How it is used: Morphological and lexical detail for Hebrew entries.
License/terms: Open-source terms in upstream repository.
Source
MorphGNT SBLGNT
Language coverage: Greek (grc)
How it is used: Word-level morphology and normalization support.
License/terms: Open-source terms in upstream repository.
Source
Open Scriptures morphhb
Language coverage: Hebrew (hbo)
How it is used: Word-level morphology and normalization support.
License/terms: Open-source terms in upstream repository.
Source
Kaikki Latin Dictionary (Wiktextract)
Language coverage: Latin (lat; source code la)
How it is used: Primary open lexical gloss coverage for Latin.
License/terms: Derived from Wiktionary under Wiktionary terms.
Source
Kaikki Classical Syriac Dictionary (Wiktextract)
Language coverage: Syriac (syr; source code syc)
How it is used: Primary open lexical gloss coverage for Classical Syriac.
License/terms: Derived from Wiktionary under Wiktionary terms.
Source
Systems Theology Early Church Originals Corpus
Language coverage: Latin and Syriac token extraction
How it is used: Corpus-derived lexical reinforcement from local original text witnesses.
License/terms: Composite according to each underlying upstream source license.
Local dataset

How We Merge Sources

We merge language-specific records into one canonical dictionary model while preserving provenance for each source witness and sense snapshot.

Normalize headwords, lemmas, transliteration, and Strong's identifiers where available.
Merge records by language and lexical identity with deterministic source-priority rules.
Keep per-source evidence in sourceRecords and senseRecords, not just a final flattened gloss.
Publish canonical JSONL under data/our_dictionaries for analysis and DynamoDB seeding.

Why Spanish and Indonesian Needed AI Generation

We did not find a complete open-source dictionary set that covers all of our source languages directly into Spanish and Indonesian at the depth needed for this project.

Our March 1, 2026 audit of Kaikki/Wiktionary extracts showed only partial ancient-language coverage. Example: Spanish included limited ancient-language entries (grc 258, hbo 14, la 6,923, syc 2), and Indonesian was much smaller (grc 1, hbo 0, la 10, syc 0).

High-Trust AI Translation Process

Build a deterministic translation key from source language + original headword + English gloss.
Deduplicate and cache keys so repeated glosses are translated once and reused.
Send both the original lexical signal and the English sense anchor to the model.
Require strict JSON output keyed to each requested item, with validation and retry logic.
Write only missing target-language gloss fields by default, preserving existing verified data.

Why This Improves Trust

We run translation with the highest-quality GPT-5-class model configured in our environment.
Using original word data plus English gloss reduces drift compared to direct source-language to Spanish/Indonesian translation.
Deterministic caching keeps repeated terms consistent across the dataset.
Unresolved entries stay empty instead of being silently fabricated.
Upstream source attribution stays attached to each dictionary record.

Important Caveat

AI-generated target-language glosses are high-trust research translations, not infallible final editions. For publication-critical contexts, they should be reviewed against source-language and English witnesses.

Back to Resources

Dictionary Sources and Translation Method

Open-Source English Dictionary Base

STEPBible TBESG Lexicon

STEPBible TBESH Lexicon

MACULA Greek

MACULA Hebrew

MorphGNT SBLGNT

Open Scriptures morphhb

Kaikki Latin Dictionary (Wiktextract)

Kaikki Classical Syriac Dictionary (Wiktextract)

Systems Theology Early Church Originals Corpus

How We Merge Sources

Why Spanish and Indonesian Needed AI Generation

High-Trust AI Translation Process

Why This Improves Trust

Important Caveat