any_ascii
cargo install any_ascii

any_ascii

Unicode to ASCII transliteration - C Go Java JS Julia PHP Python Ruby Rust Shell .NET

by anyascii

0.1.7 (see all)License:ISC
cargo install any_ascii
Readme

AnyAscii build

Unicode to ASCII transliteration

Web Demo

Table of Contents

Description

Converts Unicode text to a reasonable representation using only ASCII.

For most characters in Unicode, AnyAscii provides an ASCII-only replacement string. Text is converted character-by-character without considering the context. The mappings for each script are based on popular existing romanization schemes. Symbolic characters are converted based on their meaning or appearance. All ASCII characters in the input are left unchanged, every other character is replaced with printable ASCII characters. Unknown characters are removed.

Examples

Representative examples for different languages comparing the AnyAscii output to the conventional romanization.

Language (Script)

Input

Output

Conventional

French (Latin)

René François Lacôte

Rene Francois Lacote

Rene Francois Lacote

German (Latin)

Blöße

Blosse

Bloesse

Vietnamese (Latin)

Trần Hưng Đạo

Tran Hung Dao

Tran Hung Dao

Norwegian (Latin)

Nærøy

Naeroy

Naroy

Ancient Greek (Greek)

Φειδιππίδης

Feidippidis

Pheidippides

Modern Greek (Greek)

Δημήτρης Φωτόπουλος

Dimitris Fotopoylos

Dimitris Fotopoulos

Russian (Cyrillic)

Борис Николаевич Ельцин

Boris Nikolaevich El'tsin

Boris Nikolayevich Yeltsin

Ukrainian (Cyrillic)

Володимир Горбулін

Volodimir Gorbulin

Volodymyr Horbulin

Bulgarian (Cyrillic)

Търговище

T'rgovishche

Targovishte

Mandarin Chinese (Han)

深圳

ShenZhen

Shenzhen

Cantonese Chinese (Han)

深水埗

ShenShuiBu

Sham Shui Po

Korean (Hangul)

화성시

HwaSeongSi

Hwaseong-si

Korean (Han)

華城市

HuaChengShi

Hwaseong-si

Japanese (Hiragana)

さいたま

saitama

Saitama

Japanese (Han)

埼玉県

QiYuXian

Saitama-ken

Amharic (Ethiopic)

ደብረ ዘይት

debre zeyt

Dobre Zeyit

Tigrinya (Ethiopic)

ደቀምሓረ

dek'emhare

Dekemhare

Arabic

دمنهور

dmnhwr

Damanhur

Armenian

Աբովյան

Abovyan

Abovyan

Georgian

სამტრედია

samt'redia

Samtredia

Hebrew

אברהם הלוי פרנקל

'vrhm hlvy frnkl

Abraham Halevi Fraenkel

Unified English Braille (Braille)

⠠⠎⠁⠽⠀⠭⠀⠁⠛

+say x ag

Say it again

Bengali

ময়মনসিংহ

mymnsimh

Mymensingh

Burmese (Myanmar)

ထန်တလန်

thntln

Thantlang

Gujarati

પોરબંદર

porbmdr

Porbandar

Hindi (Devanagari)

महासमुंद

mhasmumd

Mahasamund

Kannada

ಬೆಂಗಳೂರು

bemgluru

Bengaluru

Khmer

សៀមរាប

siemrab

Siem Reap

Lao

ສະຫວັນນະເຂດ

sahvannaekhd

Savannakhet

Malayalam

കളമശ്ശേരി

klmsseri

Kalamassery

Odia

ଗଜପତି

gjpti

Gajapati

Punjabi (Gurmukhi)

ਜਲੰਧਰ

jlmdhr

Jalandhar

Sinhala

රත්නපුර

rtnpur

Ratnapura

Tamil

கன்னியாகுமரி

knniyakumri

Kanniyakumari

Telugu

శ్రీకాకుళం

srikakulm

Srikakulam

Thai

สงขลา

sngkhla

Songkhla

Symbols

Input

Output

Emojis

😎 👑 🍎

:sunglasses: :crown: :apple:

Misc.

☆ ♯ ♰ ⚄ ⛌

* # + 5 X

Letterlike

№ ℳ ⅋ ⅍

No M & A/S

Background

Unicode is the foundation for text in all modern software: it’s how all mobile phones, desktops, and other computers represent the text of every language. People are using Unicode every time they type a key on their phone or desktop computer, and every time they look at a web page or text in an application. *

Unicode is the universal character set, a global standard to support all the world's languages. It contains 140,000+ characters used by 150+ scripts along with emojis and various symbols. Typically encoded into bytes using UTF-8.

ASCII is the most compatible character set, established in 1967. It is a subset of Unicode and UTF-8 consisting of 128 characters using 7-bits. The printable characters are English letters, digits, and punctuation, with the remaining being control characters. The characters found on a standard US keyboard correspond to the printable ASCII characters.

... expressed only in the original non-control ASCII range so as to be as widely compatible with as many existing tools, languages, and serialization formats as possible and avoid display issues in text editors and source control. *

A language is written using characters from a specific script. A script can be alphabetic, logographic, syllabic, or something else. Some languages use multiple scripts: Japanese uses Kanji, Hiragana, and Katakana. Some scripts are used by multiple languages: Han characters are used in Chinese, Japanese, and Korean. The script used by English and ASCII is known as the Latin script.

When converting text between languages there are multiple properties that can be preserved:

  • Meaning: Translation
  • Appearance: Preserving the visual appearance of characters when converting between scripts is rarely possible and requires readers to have knowledge of the source language.
  • Sound: Transcription uses the spelling and pronunciation rules of the target language to produce text that will be pronounced as accurately as possible to the original.
  • Spelling: Transliteration converts each character individually using predictable rules. A reversible transliteration allows for reconstruction of the original text by using unique mappings for each character.

Romanization is the conversion into the Latin script using transliteration or transcription or a mix of both. Romanization is most commonly used when representing the names of people and places.

South Korea's Ministry of Culture & Tourism: Clear to anyone, Romanization is for foreigners. Geographical names are Romanized to help foreigners find the place they intend to go to and help them remember cities, villages and mountains they visited and climbed. But it is Koreans who make up the Roman transcription of their proper names to print on their business cards and draw up maps for international tourists. Sometimes, they write the lyrics of a Korean song in Roman letters to help foreigners join in a singing session or write part of a public address (in Korean) in Roman letters for a visiting foreign VIP. In this sense, it is for both foreigners and the local public. The Romanization system must not be a code only for the native English-speaking community here but an important tool for international communication between Korean society, foreign residents in the country and the entire external world. If any method causes much confusion because it is unable to properly reflect the original sound to the extent that different words are transcribed into the same Roman characters too frequently, it definitely is not a good system. *

Details

Comprehensive: Supports as many Unicode characters as possible. The benefits of providing full support even for rare or historic characters outweighs the small overhead of including them.

Simple: Easy to use, understand, and update. Able to be implemented with consistent behavior across multiple different programming languages. Has benefits for performance and data size.

Useful: Provides reasonable approximations of the spelling or pronunciation. Based on popular romanization systems in general use.

Implementations

AnyAscii is implemented in 9 different programming languages.

Go

https://pkg.go.dev/github.com/anyascii/go

import "github.com/anyascii/go"

s := anyascii.Transliterate("άνθρωποι")
// anthropoi

Go 1.10+ compatible

Java jitpack

https://jitpack.io/#com.anyascii/anyascii

String s = AnyAscii.transliterate("άνθρωποι");
// anthropoi

Java 6+ compatible

JavaScript npm

https://npmjs.com/package/any-ascii

// Node.js
const anyAscii = require('any-ascii');

const s = anyAscii('άνθρωποι');
// anthropoi

npm install any-ascii

PHP packagist

https://packagist.org/packages/anyascii/anyascii

$s = AnyAscii::transliterate('άνθρωποι');
// anthropoi

PHP 5.3+ compatible

composer require anyascii/anyascii

Python pypi

https://pypi.org/project/anyascii

from anyascii import anyascii

s = anyascii('άνθρωποι')
assert s == 'anthropoi'

Python 3.3+ compatible

pip install anyascii

Ruby gem

https://rubygems.org/gems/any_ascii

require 'any_ascii'

s = AnyAscii.transliterate('άνθρωποι')
# anthropoi

Ruby 2.0+ compatible

gem install any_ascii

Rust crates.io

https://crates.io/crates/any_ascii

use any_ascii::any_ascii;

let s = any_ascii("άνθρωποι");
// anthropoi

Rust 1.36+ compatible

Install:

# Cargo.toml
[dependencies]
any_ascii = "*"

Install executable: cargo install any_ascii

$ anyascii άνθρωποι
anthropoi

$ echo άνθρωποι | anyascii
anthropoi

Shell

https://raw.githubusercontent.com/anyascii/anyascii/master/sh/anyascii

$ anyascii άνθρωποι
anthropoi

$ echo άνθρωποι | anyascii
anthropoi

POSIX-compliant

.NET nuget

https://nuget.org/packages/AnyAscii

// C#
using AnyAscii;

string s = "άνθρωποι".Transliterate();
// anthropoi

Unidecode

AnyAscii is an alternative to (and inspired by) Unidecode and its many ports. Unidecode was created in 2001 and only supports characters in the BMP. AnyAscii gives better results, supports more than twice as many characters, and often has a smaller file size.

Compare table.tsv and unidecode/table.tsv for a complete comparison between AnyAscii and Unidecode. Note that the Unidecode output has been modified slightly and that unknown characters are replaced by "[?] " while they are removed by AnyAscii.

See Also

ALA-LC: Romanization Tables
BGN/PCGN: Guidance on Romanization Systems
Discord: Emojis
ISO: Transliteration Standards
KNAB: Romanization Systems
Thomas T. Pedersen: Transliteration of Non-Roman Scripts
UNGEGN: Working Group on Romanization Systems
Unicode Technical Site
Wikipedia: Romanization
Wiktionary: Romanization

GitHub Stars

80

LAST COMMIT

4mos ago

MAINTAINERS

1

CONTRIBUTORS

2

OPEN ISSUES

6

OPEN PRs

2
VersionTagPublished
0.1.7
2yrs ago
0.1.6
2yrs ago
0.1.5
2yrs ago
0.1.4
2yrs ago
No alternatives found
No tutorials found
Add a tutorial