Dictionary Sources and Translation Method

This page explains where our dictionary data comes from, how we merge multiple open-source English datasets, and how we generate Spanish and Indonesian glossary entries when full source-language dictionaries do not exist.

Research status as of March 1, 2026.

Open-Source English Dictionary Base

English lexical coverage exists across multiple open datasets for our source languages (Greek, Hebrew, Latin, and Syriac). We ingest and normalize these as the core dictionary layer.

How We Merge Sources

We merge language-specific records into one canonical dictionary model while preserving provenance for each source witness and sense snapshot.

  • Normalize headwords, lemmas, transliteration, and Strong's identifiers where available.
  • Merge records by language and lexical identity with deterministic source-priority rules.
  • Keep per-source evidence in sourceRecords and senseRecords, not just a final flattened gloss.
  • Publish canonical JSONL under data/our_dictionaries for analysis and DynamoDB seeding.

Why Spanish and Indonesian Needed AI Generation

We did not find a complete open-source dictionary set that covers all of our source languages directly into Spanish and Indonesian at the depth needed for this project.

Our March 1, 2026 audit of Kaikki/Wiktionary extracts showed only partial ancient-language coverage. Example: Spanish included limited ancient-language entries (grc 258, hbo 14, la 6,923, syc 2), and Indonesian was much smaller (grc 1, hbo 0, la 10, syc 0).

High-Trust AI Translation Process

  • Build a deterministic translation key from source language + original headword + English gloss.
  • Deduplicate and cache keys so repeated glosses are translated once and reused.
  • Send both the original lexical signal and the English sense anchor to the model.
  • Require strict JSON output keyed to each requested item, with validation and retry logic.
  • Write only missing target-language gloss fields by default, preserving existing verified data.

Why This Improves Trust

  • We run translation with the highest-quality GPT-5-class model configured in our environment.
  • Using original word data plus English gloss reduces drift compared to direct source-language to Spanish/Indonesian translation.
  • Deterministic caching keeps repeated terms consistent across the dataset.
  • Unresolved entries stay empty instead of being silently fabricated.
  • Upstream source attribution stays attached to each dictionary record.

Important Caveat

AI-generated target-language glosses are high-trust research translations, not infallible final editions. For publication-critical contexts, they should be reviewed against source-language and English witnesses.

Back to Resources