Wiktra: transliteration tool using Wiktionary transliteration modules. Version 2 (fork)
Wiktra is a versatile Unicode transliteration tool that brings the linguistic precision of Wiktionary’s community-curated transliteration modules to your command line and Python projects. It allows you to convert text from one writing system (script) to another with a high degree of accuracy.
Project Locations:
At its core, Wiktra transliterates text. This means it converts characters or words from one script (e.g., Cyrillic, Arabic, Devanagari) into another (e.g., Latin script). Unlike simple character-by-character replacement, Wiktra utilizes sophisticated rule-based transliteration modules written in Lua, developed and maintained by linguists and contributors on Wiktionary. These modules understand the nuances of how languages are written, leading to more accurate and contextually appropriate results.
Wiktra provides:
wiktrapy
wiktra
Wiktra 1.0 was originally developed by Khuyagbaatar Batsuren. Wiktra 2 was significantly rewritten by Adam Twardoch.
Wiktra is designed for a diverse range of users:
wiktrapy --stats
for a current list.Wiktra requires Python 3.9+ and Lua (specifically LuaJIT is recommended for performance with the lupa
bridge).
General Installation (using pip):
The primary way to install Wiktra is via pip:
python3 -m pip install wiktra
This will attempt to install Wiktra and its Python dependencies, including lupa
, which bridges Python and Lua. The lupa
installation might require Lua development headers to be present on your system.
macOS:
For macOS, a convenience script install-mac.sh
(available in the source repository) can help install prerequisites like Lua via Homebrew:
./install-mac.sh
# If installing from a local clone after running the script:
python3 -m pip install --upgrade .
# Or to get the latest from PyPI:
python3 -m pip install --upgrade wiktra
Linux (Debian/Ubuntu Example):
You’ll need to install Python 3, pip, and Lua development files.
sudo apt update
sudo apt install python3 python3-pip liblua5.1-0-dev luajit
# For lupa, LuaJIT (libluajit-5.1-dev) is often preferred over standard Lua dev packages.
# Depending on your distribution and lupa version, you might need different Lua versions like lua5.3-dev etc.
python3 -m pip install wiktra
Windows:
Installation on Windows can be more complex due to lupa
compilation.
lupa
typically requires a C compiler (like Microsoft C++ Build Tools, often installed with Visual Studio) and Lua (e.g., by compiling Lua from source, or using a package manager like Scoop or Chocolatey to install Lua/LuaJIT).lupa
wheels are available for your Python version and architecture on PyPI. If not, manual setup of the build environment is necessary.lupa
can be installed (i.e., its prerequisites are met), Wiktra can be installed via pip:
pip install wiktra
Note: The original README mentioned that version 2 had not been working well on Ubuntu and Windows 10 at one point. While efforts are made to ensure cross-platform compatibility, installing lupa
correctly is often the main hurdle. Refer to the lupa
documentation for specific guidance on its installation.
Troubleshooting Installation:
LuaError: module 'wikt.mw' not found
or similar Lua errors: This typically means the Lua runtime cannot find the Wiktionary modules. Wiktra attempts to set the LUA_PATH
environment variable correctly during runtime. If issues persist, it might indicate a problem with how lupa
is locating Lua files or an incomplete installation.lupa
installation issues: These are common. Ensure you have a C compiler and the correct Lua (or LuaJIT) development libraries (headers) installed. Consult lupa
’s documentation and open issues for platform-specific advice. Using virtual environments (e.g., venv
) is highly recommended.Wiktra offers two main ways to perform transliterations:
wiktrapy
)The wiktrapy
tool is perfect for quick transliterations or use in shell scripts.
Basic syntax:
wiktrapy [options] -t "Your text here"
# or pipe text into it
echo "Your text here" | wiktrapy [options]
Examples:
wiktrapy -t "Привет"
# Expected Output: Privet
echo "नमस्ते" | wiktrapy
# Expected Output: namaste
wiktrapy -t "Привет" -l ru -s Cyrl
# Expected Output: Privet
Here, -l ru
specifies Russian and -s Cyrl
specifies Cyrillic script.
# This example assumes a module exists for English (Latn) to Cyrillic (Cyrl)
# wiktrapy -t "Hello" -l en -s Latn -o Cyrl
The default output script is Latn
(Latin).
wiktrapy --stats
wiktrapy -h
wiktra
)For more programmatic control, use the wiktra
Python module. The recommended way is to use the Transliterator
class.
Example (New API - Recommended):
from wiktra.Wiktra import Transliterator
# Create a Transliterator instance
# This is best done once if you're doing multiple transliterations
tr = Transliterator()
# Transliterate text with automatic language/script detection
# (will try to guess input script and use 'und' - undefined language for that script)
text_cyrillic = "Привет мир"
latin_text = tr.tr(text_cyrillic)
print(f"'{text_cyrillic}' -> '{latin_text}'")
# Expected Output: 'Привет мир' -> 'Privet mir'
text_devanagari = "नमस्ते दुनिया"
latin_text_dev = tr.tr(text_devanagari)
print(f"'{text_devanagari}' -> '{latin_text_dev}'")
# Expected Output: 'नमस्ते दुनिया' -> 'namaste duniyaa'
# Explicitly specify language, input script, and output script
text_russian = "Русский текст"
# lang='ru' (Russian), sc='Cyrl' (Cyrillic), to_sc='Latn' (Latin)
transliterated_explicit = tr.tr(text_russian, lang='ru', sc='Cyrl', to_sc='Latn', explicit=True)
print(f"'{text_russian}' (explicit) -> '{transliterated_explicit}'")
# Expected Output: 'Русский текст' (explicit) -> 'Russkij tekst'
# Using the class instance is more efficient for multiple transliterations
# as the Lua runtime and modules are initialized only once.
explicit=True
, you must provide lang
(input language code, e.g., ISO 639) and sc
(input script code, e.g., ISO 15924). to_sc
(output script code) defaults to Latn
if not specified.explicit=False
(the default), Wiktra attempts to guess the input script if sc
is not provided. It then typically assumes an “undefined” (und
) language for that script, unless lang
is also provided.Legacy Function (translite
):
A legacy translite
function is also available, primarily for compatibility with older versions of Wiktra or specific use cases that relied on its distinct language code mapping.
from wiktra.Wiktra import translite as tr_legacy
# Example for Mongolian (Cyrillic) using its legacy code 'mon'
mongolian_text = "монгол бичлэг"
transliterated_mongolian = tr_legacy(mongolian_text, 'mon')
print(f"'{mongolian_text}' (legacy) -> '{transliterated_mongolian}'")
# Expected Output: 'монгол бичлэг' (legacy) -> 'mongol bichleg'
It is generally recommended to use the new Transliterator.tr()
method for its more standardized approach to language/script codes and broader capabilities.
Wiktra can update its local collection of Wiktionary transliteration modules using the wiktrapy_update
command:
wiktrapy_update -h # For options
wiktrapy_update
This helps keep your transliterations aligned with the latest rules from Wiktionary.
This section provides a deeper insight into Wiktra’s architecture, its core components, and guidelines for coding and contributing.
Wiktra’s ability to perform complex transliterations stems from its use of Lua modules sourced directly from Wiktionary, executed within a Python environment.
1. Python-Lua Integration via lupa
:
The core of Wiktra’s cross-language functionality is the lupa
library. lupa
provides a bridge between Python and Lua (specifically designed for LuaJIT, but can work with standard Lua), allowing Python code to:
2. The Transliterator
Class (wiktra/Wiktra.py
):
This is the central class orchestrating transliteration.
__init__
): When a Transliterator
object is created, it:
LuaRuntime
instance from lupa
.wiktra/wikt/data/data.json
. This JSON file is key to identifying the correct Lua module for a given transliteration request.LUA_PATH
environment variable to ensure Lua can locate the necessary modules within the wiktra/wikt/
directory structure.tr
): This is the primary method for the new API.
explicit=True
, it directly uses the provided lang
(language), sc
(input script), and to_sc
(output script).explicit=False
(default), it calls auto_script_lang
to deduce the input script and language if they are not fully specified.auto_script_lang
Method: This helper method determines the script of the input text using fontTools.unicodedata.ucd.script()
if no script is provided. It then uses langcodes.closest_match()
to find the best matching language/script combination available in Wiktra’s supported list (derived from data.json
) against the (partially) specified or detected input.ru-translit
, ar-translit
) from the mod_map
(loaded from data.json
). It then constructs and executes a Lua command like require("wikt.translit.MODULE_NAME").tr("text_to_transliterate", "lang_code", "script_code")
. The result from the Lua function is then returned to Python.tr_legacy
): This method supports the older translite
function’s interface. It uses an internal, hardcoded lang_map
(defined in Wiktra.py
) to map legacy language codes to Wiktionary module names and script codes before invoking the Lua modules.3. Wiktionary Lua Modules (wiktra/wikt/
):
This directory and its subdirectories contain the Lua code and data sourced from Wiktionary.
wiktra/wikt/translit/
): These are individual Lua files (e.g., ru-translit.lua
, ar-translit.lua
) containing the transliteration rules for specific languages or scripts. Each module typically exposes a tr(text, lang_code, script_code)
function that Wiktra calls.wiktra/wikt/mw/
, wiktra/wikt/mw-*.lua
, etc.): These are supporting Lua modules, also from Wiktionary, providing common functionalities (like Unicode string manipulation via mw.ustring
, message handling, site utilities) that the transliteration modules often depend on. mwInit.lua
likely initializes this MediaWiki-like Lua environment for the modules.wiktra/wikt/data/
):
data.json
: The primary mapping used by Transliterator
to find the correct Lua module for a new API request (maps script -> language -> output_script -> module_info).data.yaml
: A human-readable YAML version of the module mappings, also listing the Wiktionary transliteration modules used. Useful for reference and potentially for generation of data.json
.translit/data/
or language-specific data modules) are used directly by the Lua modules themselves.make-data-lang.lua
appears to be a script used in the process of generating or updating language data mappings.4. CLI Entry Point (wiktra/__main__.py
):
This script powers the wiktrapy
command-line tool.
argparse
module to define and parse command-line arguments (e.g., -t
for text, -l
for language, -s
for script).-t
argument, an input file specified by -i
, or directly from stdin
if no text source is given.Transliterator
class from wiktra.Wiktra
.tr()
method of the Transliterator
instance with the processed arguments and prints the transliterated result to standard output.--stats
option to display a list of supported scripts and orthographies by querying the Transliterator
instance.5. Module Update Mechanism (wiktra/update.py
and wiktrapy_update
):
Wiktra includes a built-in mechanism to update its local cache of Wiktionary Lua modules and associated data.
wiktrapy_update
console script (which calls main
in wiktra.update
) manages this process.update.py
). This functionality is crucial for keeping Wiktra’s transliteration capabilities current with the ongoing improvements made by the Wiktionary community.We welcome contributions to Wiktra! Here are some guidelines:
Coding Style:
Dependencies:
requirements.txt
and declared in setup.py
.python -m venv .venv
) is strongly recommended for development.wiktra/wikt/
directory.Testing:
Module:xx-translit/testcases
). These can serve as excellent references for expected behavior.Contribution Process:
feature/add-georgian-translit
or fix/unicode-error-arabic
).Reporting Issues:
wiktrapy -V
).License:
Wiktra is distributed under the GPLv2 (GNU General Public License version 2). All contributions to the project are also expected to be made under this license.
This README aims to be a comprehensive guide for both users and developers of Wiktra. For further details, exploring the source code and the linked Wiktionary resources is encouraged.