wiktra2

Wiktra: transliteration tool using Wiktionary transliteration modules. Version 2 (fork)

View the Project on GitHub twardoch/wiktra2

Wiktra - Transliteration tool using Wiktionary transliteration modules

Wiktra is a Unicode transliteration tool, written in Python. It’s available as the wiktrapy CLI app and the wiktra Python 3 module.

Internally, it uses transliteration modules from Wiktionary. These modules are written in Lua by the Wiktionary linguists and developers. Therefore, Wiktra offers the highest quality of rule-based transliterations.

Wiktra 1.0 was originally developed by Khuyagbaatar Batsuren. Wiktra 2 was rewritten by Adam Twardoch.

Locations:

Wiktra 2 supports 514 languages in 102 scripts with the new API (nearly all of languages supported by Wiktionary, except Korean, Japanese and Thai), and 181 languages and its 60 orthographies in the legacy API.

Installation

Version 2

(This has been tested on macOS 11.) Unfortunately, it hasn’t been working on Ubuntu and Windows 10. We are working on testing these installations and fixing the bugs.

First installation

Download and unzip the current repo content. Then, in Terminal, cd to the main folder and run:

./install-mac.sh
python3 -m pip install --upgrade .

This will install brew if needed, the installs lua, luarocks, lua-format, luajit and python3. Finally, it installs some Python dependencies, such as lupa or pywikiapi.

Updates

python3 -m pip install --upgrade git+https://github.com/kbatsuren/wiktra/

Other systems, version 1

This is from the original version 1. Quite possibly the Version 2 instructions (see above) should work instead.

As much as you want to use your favorite version of Python, it is recommended to employ 3.5 version on the grounds that the module utilizes lupa-1.8. Lupa enables Python to adopt functionalities of Lua language, in which most of the transliteration modules are written.

The following commands prepare a python environment via Anaconda, which promotes the version and the module dependencies:

First:

$ pip install lupa
$ conda create -n scr2scr_env python=3.5

Second:

$ conda activate scr2scr_env

Start your Python (3.5.x):

$ python

Troubleshooting

This should no longer be an issue with version 2.

If you get LuaError: module 'wikt.mw' not found, try:

Usage

Command-line, version 2

$ wiktrapy -h

usage: wiktrapy [-h] [-t TEXT] [-i FILE] [-l LANG] [-s SCRIPT] [-o SCRIPT] [-x] [--stats] [-v] [-V]

optional arguments:
  -h, --help            show this help message and exit
  -t TEXT, --text TEXT
  -i FILE, --input FILE
  -l LANG, --lang LANG  Input language as ISO 639-2 code
  -s SCRIPT, --script SCRIPT
                        Input script as ISO 15924 code
  -o SCRIPT, --to-script SCRIPT
                        Output script as ISO 15924 code
  -x, --explicit        Explicit language/script, no fuzzy matching
  --stats               List supported scripts and orthographies
  -v, --verbose         -v show progress, -vv show debug
  -V, --version         show version and exit

Example:

$ wiktrapy -t "Привет" -l ru -s Cyrl
Privet

Or from stdin / via piping:

$ echo Привет | wiktrapy
Privet

Python, version 2 new API

from wiktra.Wiktra import Transliterator
tr = Transliterator()

print(tr.tr("Привет", lang='ru', sc='Cyrl', to_sc='Latn', explicit=True)

or less efficiently:

from wiktra import tr
print(tr("Привет"))

Use wiktrapy --stats to list all supported script and language codes, or see the data.yaml. The YAML file also lists the Wiktionary transliteration modules used.

Python, legacy translite function

from wiktra.Wiktra import translite as tr

With the function translite, you need to provide the text and the lang code (see table below for reference):

#mongolian script
tr('монгол бичлэг', 'mon')
> mongol bichleg

#devanagari script
tr('हिंदी लिपि', 'hin')
> hindee lipi

Supported scripts and languages

CLI tool and new Python API

Legacy translite function

  Language iso-3 in use wiktionary code Supporting script examples
1 Abaza abq abq Cyrillic  
2 Abkhazian abk ab Cyrillic  
3 Adyghe ady ady Cyrillic  
4 Ahom aho aho Ahom tests
5 Ainu ain ain Kana tests
6 Altai, Southern alt altai Cyrillic  
7 Amharic amh am Ethiopic tests
8 Ancient Greek grc-c grc Cypriot script  
9 Ancient Greek grc grc Greek (Polythonic) tests
10 Ancient Macedonian xmk xmk Greek (Polythonic) tests
11 Arabic ara ar Arab tests
12 Ardhamagadhi Prakrit pka pka Brahmi  
13 Armenian arm armn Armn  
14 Ashokan Prakrit inc-ash inc-ash Brahmi  
15 Ashokan Prakrit inc-ash-k inc-ash Kharoshthi script tests
16 Assamese asm as as-Beng  
17 Avaric ava av Cyrillic  
18 Avestan ave avst Avst  
19 Awadhi awa awa Devanagari  
20 Bactrian xbc xbc Grek  
21 Bagheli bfy bfy Devanagari  
22 Bashkir bak ba Cyrillic  
23 Bats bbl bbl Georgian script tests
24 Belarusian bel be Cyrillic  
25 Bengali ben bn Beng tests
26 Berber* ber ber Tfng  
27 Bhadrawahi bhd bhd Devanagari  
28 Bhojpuri bho bho Devanagari  
29 Bilaspuri kfs kfs Devanagari  
30 Blin byn byn Ethiopic tests
31 Braj bra bra Devanagari  
32 Budukh bud bdk Cyrillic  
33 Bulgarian bul bg Cyrillic  
34 Bundeli bns bns Devanagari  
35 Burmese mya my Burmese  
36 Buryat bua bua Cyrillic  
37 Canadian syllabics cans cans Canadian syllabics  
38 Cappadocian Greek cpg cpg Greek (Polythonic) tests
39 Chaha sem-cha sem-cha Ethiopic tests
40 Chambeali cdh cdh Devanagari  
41 Chechen che ce Cyrillic  
42 Cherokee chr cher Cher  
43 Churahi cdj cdj Devanagari  
44 Church Slavic chu cv Cyrillic  
45 Coptic cop copt Copt  
46 Cree cre cr Cans  
47 Dargwa dar dar Cyrillic  
48 Dhivehi div dv Thaa  
49 Dogri doi-d doi Devanagari  
50 Dolgan dlg dlg Cyrillic  
51 Doteli dty dty Devanagari  
52 Dungan dng dng Cyrillic  
53 Eastern Mari chm chm Cyrillic  
54 Erzya myv myv Cyrillic  
55 Even eve eve Cyrillic  
56 Evenki evn evn Cyrillic  
57 Gaddi gbk gbk Devanagari  
58 Gandhari pgd-k pgd Kharoshthi script tests
59 Garhwali gbm gbm Devanagari  
60 Ge’ez gez gez Ethiopic tests
61 Georgian geo geor Georgian script tests
62 Gothic got goth Gothic script  
63 Gujarati guj gu Gujarati tests
64 Hadrami xhd xhd Old South Arabian script  
65 Harami xha xha Old South Arabian script  
66 Harari har har Ethiopic tests
67 Haryanvi bgc bgc Devanagari  
68 Hebrew heb he Hebrew  
69 Hindi hin hi Devanagari tests
70 Ingush inh inh Cyrillic  
71 Inuktitut iku iu Cans  
72 Javanese jav jv Javanese  
73 Kabardian kbd kbd Cyrillic  
74 Kachchi kfr kfr Gujarati tests
75 Kalmyk xal xal Cyrillic tests
76 Kangri xnr xnr Devanagari  
77 Kannada kan kn Knda  
78 Karachay-Balkar krc krc Cyrillic  
79 Karaim kdr kdr Cyrillic  
80 Karakalpak kaa kaa Cyrillic  
81 Kashmiri kas ks Kashmiri arabic  
81 Kashmiri kas-d ks Kashmiri devanagri  
82 Kazakh kaz kk Cyrillic  
83 Khakas kjh kjh Cyrillic  
84 Khanty kca kca Cyrillic  
85 Khinalugh kjj kjj Cyrillic  
86 Khmer khm km Khmer script  
87 Khotanese kho kho Brahmi  
88 Kildin Sami sjd sjd Cyrillic  
89 Kipchak qwm qwm Armn  
90 Komi-Permyak koi koi Cyrillic  
91 Komi-Zyrian kpv kpv Cyrillic  
92 Konkani kok kok Devanagari  
93 Korean kor ko Kore tests
94 Kullu Pahari kfx kfx Devanagari  
95 Kumyk kum kum Cyrillic  
96 Kyrgyz kir ky Cyrillic  
97 Lak lbe lbe Cyrillic  
98 Lao lao lo Laoo  
99 Latin-to-Tamil tam en-ta Latn  
100 Laz lzz lzz Georgian script tests
101 Lepcha lep lep Lepcha  
102 Lezgi lez lez Cyrillic  
103 Limbu lif lif Limbu  
104 khb khb New Tai Lue  
105 Lycian xlc xlc Lycian  
106 Lydian xld xld Lydi  
107 Macedonian mkd mk Cyrillic  
108 Magadhi Prakrit inc-mgd inc-mgd Brahmi  
109 Maharastri Prakrit pmh pmh Brahmi  
110 Mahasu Pahari bfz bfz Devanagari  
111 Malayalam mal ml Mlym  
112 Mandeali mjl mjl Devanagari  
113 Mansi mns mns Cyrillic  
114 Marathi mar mr Devanagari  
115 Marwari mwr mwr Devanagari  
116 Mewari mtr mtr Devanagari  
117 Middle Assamese inc-mas inc-mas Assamese  
118 Middle Persian pal-m pal Manichaean script  
119 Middle Persian pal-p pal Phli  
120 Minaean inm inm Old South Arabian script  
121 Mingrelian xmf xmf Georgian script tests
122 Moksha mdf mdf Cyrillic  
123 Mongolian mon mon Cyrillic  
124 Mundari unr unr Devanagari  
125 Mycenaean Greek gmy gmy Linear B script  
126 Naskapi nsk nsk Canadian syllabics  
127 Nepali nep ne Devanagari  
128 Newar new new Devanagari  
129 Nivkh niv niv Cyrillic  
130 Nogai nog nog Cyrillic  
131 Northern Kurdish kmr kmr Cyrillic  
132 Northern Yukaghir ykg ykg Cyrillic  
133 Old Church Slavonic chu-old-c cu Old Cyrillic alphabets  
134 Old Church Slavonic chu-old-g cu Glagolitic alphabets  
135 Old East Slavic orv orv Old Cyrillic alphabets  
136 Old Georgian oge oge Georgian script tests
137 Old Hindi inc-ohi inc-ohi Devanagari tests
138 Old Italic Old Italic script      
139 Old Marathi omr omr Devanagari  
140 Old Novgorodian zle-ono-c zle-ono Old Cyrillic alphabets  
141 Old Novgorodian zle-ono-g zle-ono Glagolitic alphabets  
142 Old Ossetic oos oos Greek (Polythonic) tests
143 Old Persian peo peo Old Persian  
144 Old Tamil oty oty Brahmi  
145 Orya ori or Oriya  
146 Ossetian oss os Cyrillic  
147 Paeonian ine-pae ine-pae Greek (Polythonic) tests
148 Paisaci Prakrit inc-psc inc-psc Brahmi  
149 Palya Bareli bpx bpx Devanagari  
150 Pangwali pgg pgg Devanagari  
151 Parthian xpr xpr Manichaean script  
152 Parthian xpr xpr Parthian  
153 Persian fas fa fa-Arab tests
154 Phrygian xpg xpg Greek (Polythonic) tests
155 Pontic Greek pnt pnt Greek (Polythonic) tests
156 Punjabi pan pal Guru tests
157 Qatabanian xqt xqt Old South Arabian script  
158 Russian rus ru Cyrillic tests
159 Rusyn rue rue Cyrillic  
160 Sabaean xsa xsa Old South Arabian script  
161 Sambalpuri spv spv Oriya  
162 Sanskrit san sa Devanagari  
163 Santali sat sat Ol Chiki  
164 Sauraseni Prakrit psu psu Brahmi  
165 Sichuan Yi iii ii Yi script  
166 Sinhalese sin si Sinh  
167 Sogdian sog sog Manichaean script  
168 Tajik tgk tg Cyrillic  
169 Takka Apabhramsa inc-tak inc-tak Devanagari  
170 Tamil tam ta Tamil tests
171 Tatar tat tt Cyrillic  
172 Telugu tel te Telu tests
173 Thai* tha th Thai tests
174 Thracian txh txh Greek tests
175 Tibetan bod bo Tibetan tests
176 Tigre tig tig Ethiopic tests
177 Tigrinya tir ti Ethiopic tests
178 Tuvan tyv tyv Cyrillic  
179 Udi udi udi Cyrillic  
180 Udi udi udi Georgian script tests
181 Udmurt udm udm Cyrillic  
182 Ukrainian ukr uk Cyrillic tests
183 Urdu urd ur Urdu Arabic  
184 Uyghur uig ug Uyghur Arabic tests
185 Vaghri vgr vgr Gujarati tests
186 Vracada Apabhramsa inc-vra inc-vra Devanagari  
187 Wakhi wbl wbl Cyrillic  
188 Yagnobi yai yai Cyrillic  
189 Yakut sah sah Cyrillic  
190 Modern Greek (new) ell el Greek tests

Updating

This tool an update its stored Wiktionary modules. See wiktrapy_update -h for details.

License

This tool is available under the GPLv2 license.