Dictionary Sources and Translation Method
This page explains where our dictionary data comes from, how we merge multiple open-source English datasets, and how we generate Spanish and Indonesian glossary entries when full source-language dictionaries do not exist.
Research status as of March 1, 2026.
Open-Source English Dictionary Base
English lexical coverage exists across multiple open datasets for our source languages (Greek, Hebrew, Latin, and Syriac). We ingest and normalize these as the core dictionary layer.
STEPBible TBESG Lexicon
Language coverage: Greek (grc)
How it is used: Primary Greek lexical gloss and dictionary backbone.
License/terms: CC BY (per upstream repository terms).
STEPBible TBESH Lexicon
Language coverage: Hebrew (hbo)
How it is used: Primary Hebrew lexical structure and cross-reference anchors.
License/terms: CC BY (per upstream repository terms).
MACULA Greek
Language coverage: Greek (grc)
How it is used: Morphological and lexical detail for Greek entries.
License/terms: Open-source terms in upstream repository.
MACULA Hebrew
Language coverage: Hebrew (hbo)
How it is used: Morphological and lexical detail for Hebrew entries.
License/terms: Open-source terms in upstream repository.
MorphGNT SBLGNT
Language coverage: Greek (grc)
How it is used: Word-level morphology and normalization support.
License/terms: Open-source terms in upstream repository.
Open Scriptures morphhb
Language coverage: Hebrew (hbo)
How it is used: Word-level morphology and normalization support.
License/terms: Open-source terms in upstream repository.
Kaikki Latin Dictionary (Wiktextract)
Language coverage: Latin (lat; source code la)
How it is used: Primary open lexical gloss coverage for Latin.
License/terms: Derived from Wiktionary under Wiktionary terms.
Kaikki Classical Syriac Dictionary (Wiktextract)
Language coverage: Syriac (syr; source code syc)
How it is used: Primary open lexical gloss coverage for Classical Syriac.
License/terms: Derived from Wiktionary under Wiktionary terms.
Systems Theology Early Church Originals Corpus
Language coverage: Latin and Syriac token extraction
How it is used: Corpus-derived lexical reinforcement from local original text witnesses.
License/terms: Composite according to each underlying upstream source license.
Local dataset
How We Merge Sources
We merge language-specific records into one canonical dictionary model while preserving provenance for each source witness and sense snapshot.
- Normalize headwords, lemmas, transliteration, and Strong's identifiers where available.
- Merge records by language and lexical identity with deterministic source-priority rules.
- Keep per-source evidence in sourceRecords and senseRecords, not just a final flattened gloss.
- Publish canonical JSONL under data/our_dictionaries for analysis and DynamoDB seeding.
Why Spanish and Indonesian Needed AI Generation
We did not find a complete open-source dictionary set that covers all of our source languages directly into Spanish and Indonesian at the depth needed for this project.
Our March 1, 2026 audit of Kaikki/Wiktionary extracts showed only partial ancient-language coverage. Example: Spanish included limited ancient-language entries (grc 258, hbo 14, la 6,923, syc 2), and Indonesian was much smaller (grc 1, hbo 0, la 10, syc 0).
High-Trust AI Translation Process
- Build a deterministic translation key from source language + original headword + English gloss.
- Deduplicate and cache keys so repeated glosses are translated once and reused.
- Send both the original lexical signal and the English sense anchor to the model.
- Require strict JSON output keyed to each requested item, with validation and retry logic.
- Write only missing target-language gloss fields by default, preserving existing verified data.
Why This Improves Trust
- We run translation with the highest-quality GPT-5-class model configured in our environment.
- Using original word data plus English gloss reduces drift compared to direct source-language to Spanish/Indonesian translation.
- Deterministic caching keeps repeated terms consistent across the dataset.
- Unresolved entries stay empty instead of being silently fabricated.
- Upstream source attribution stays attached to each dictionary record.
Important Caveat
AI-generated target-language glosses are high-trust research translations, not infallible final editions. For publication-critical contexts, they should be reviewed against source-language and English witnesses.
Back to Resources