TugaPhone — Dialect-aware Portuguese Phonemizer

TugaPhone is a Python library that phonemizes arbitrary Portuguese text across major Lusophone dialects (pt-PT, pt-BR, pt-AO, pt-MZ, pt-TL). It uses a curated phonetic lexicon plus a rule-based fallback to deliver plausible phoneme transcriptions while preserving dialectal variation.

Choveu muito ontem à noite.
pt-PT-x-porto → ˈʃɔ·vew mˈũj·tu õ·ˈtẽ ˈa nˈuoj·tɨ 
pt-PT → ˈʃɔ·vew mˈũj·tu õ·ˈtẽ ˈa nˈoj·tɨ 
pt-BR → ˈʃɔ·vew mwˈĩ·tʊ õ·ˈtẽ ˈa nˈoj·tʃɪ 
pt-AO → ˈʃɔ·vew mˈũjn·tʊ õ·ˈtẽ ˈa nˈoj·tɨ 
pt-MZ → ˈʃɔ·vew mˈũj·tu õ·ˈtẽ ˈa nˈɔj·tɨ 
pt-TL → ˈʃɔ·vew mˈuj·tʊ õ·ˈtẽ ˈa nˈojtʰ

🚀 Features

Multi-dialect support: European Portuguese (pt-PT), Brazilian Portuguese (pt-BR), Angolan (pt-AO), Mozambican (pt-MZ), and Timorese (pt-TL)
Regional accent modeling: Additional micro-dialects like Porto, Minho, Braga, Trás-os-Montes, and more
Hybrid approach: Combines a curated phonetic lexicon (Portuguese Phonetic Lexicon) with rule-based G2P fallback
Context-aware: Takes part-of-speech tags into account for homograph disambiguation
Number normalization: Automatically converts digits to their Portuguese spoken forms with proper gender agreement
Syllabification: Rule-based syllable boundary detection (~99.6% accuracy on benchmark)
Stress detection: Automatic stress placement following Portuguese phonological rules
IPA output: Full International Phonetic Alphabet transcription with stress markers and syllable boundaries

📦 Installation

pip install tugaphone

🧰 Usage

Companion libraries

The follow libraries are dependencies of tugaphone and might be useful on their own

Tugalex - Lexicon of words and exceptions
TugaTagger - portuguese text postagger
silabificador - portuguese text syllabification

Basic Phonemization

from tugaphone import TugaPhonemizer

ph = TugaPhonemizer()

sentences = [
    "O gato dorme.",
    "Tu falas português muito bem.",
    "O comboio chegou à estação.",
    "A menina comeu o pão todo.",
    "Vou pôr a manteiga no frigorífico."
]

for s in sentences:
    print(f"Sentence: {s}")
    for code in ["pt-PT", "pt-BR", "pt-AO", "pt-MZ", "pt-TL"]:
        phones = ph.phonemize_sentence(s, code)
        print(f"  {code} → {phones}")
    print("-----")

Regional Dialects

from tugaphone import TugaPhonemizer
from tugaphone.regional import PortoDialect, MinhoDialect, BragaDialect

ph = TugaPhonemizer()

sentence = "O Porto é uma cidade bonita."

# Standard European Portuguese
print(f"pt-PT: {ph.phonemize_sentence(sentence, 'pt-PT')}")

# Porto accent (rising diphthongs, rhotic realization)
print(f"Porto: {ph.phonemize_sentence(sentence, regional_dialect=PortoDialect)}")

# Minho accent (vowel resistance, open vowels)
print(f"Minho: {ph.phonemize_sentence(sentence, regional_dialect=MinhoDialect)}")

Number Normalization

from tugaphone.number_utils import normalize_numbers

# Automatic gender agreement
print(normalize_numbers("vou comprar 1 casa"))    # uma casa
print(normalize_numbers("vou comprar 2 casas"))   # duas casas
print(normalize_numbers("vou adotar 1 cão"))      # um cão
print(normalize_numbers("vou adotar 2 cães"))     # dois cães

# Ordinals
print(normalize_numbers("1º lugar"))              # primeiro lugar
print(normalize_numbers("1ª vez"))                # primeira vez

# Large numbers with scale differences
print(normalize_numbers("897654356789098", "pt-PT"))  # long-scale (biliões)
print(normalize_numbers("897654356789098", "pt-BR"))  # short-scale (trilhões)

Advanced: Tokenization and Features

from tugaphone.tokenizer import Sentence
from tugaphone.dialects import EuropeanPortuguese

sentence = Sentence("O cão comeu o pão.", dialect=EuropeanPortuguese())

print(f"IPA: {sentence.ipa}")

# Access word-level details
for word in sentence.words:
    print(f"\nWord: {word.surface}")
    print(f"  Syllables: {'.'.join(word.syllables)}")
    print(f"  Stress: syllable {word.stressed_syllable_idx}")
    print(f"  IPA: {word.ipa}")
    
    # Access grapheme-level details
    for grapheme in word.graphemes:
        if grapheme.is_diphthong:
            print(f"  Diphthong: {grapheme.surface} → {grapheme.ipa}")

📖 Documentation

Supported Dialects

Dialect Code	Region	Characteristics
`pt-PT`	European Portuguese (Lisbon)	Heavy vowel reduction, fricative palatalization, uvular /r/
`pt-BR`	Brazilian Portuguese (Rio)	Less vowel reduction, t/d palatalization, l-vocalization
`pt-AO`	Angolan Portuguese (Luanda)	Moderate vowel reduction, alveolar trill /r/, Bantu substrate
`pt-MZ`	Mozambican Portuguese (Maputo)	Similar to European with regional variation, Bantu influence
`pt-TL`	Timorese Portuguese (Dili)	Conservative pronunciation, Tetum substrate influence

Regional Accents (Experimental)

TugaPhone includes experimental support for sub-regional Portuguese accents:

PortoDialect: Rising diphthongs (o → uo), rhotic realization
MinhoDialect: Reduced vowel centralization, open vowel preference
BragaDialect: Palatal epenthesis (abelha → abeilha)
TrasMontanoDialect: Palatal affrication, s-voicing, final nasal denasalization
FafeDialect: Nasal diphthongization (gente → geinte)

Note: These are based on documented phonological features but should be considered approximate. Real-world variation is more complex.

Part-of-Speech Tagging

TugaPhone uses POS tags to disambiguate homographs:

from tugaphone import TugaPhonemizer

ph = TugaPhonemizer(postag_engine="spacy")  # or "brill", "auto"

# "para" has different pronunciations as preposition vs. verb
print(ph.phonemize_sentence("Vou para casa."))      # preposition
print(ph.phonemize_sentence("Ele para o carro."))   # verb

Supported engines:

spacy: Requires spacy and Portuguese model (most accurate)
brill: Requires brill-postaggers (lighter, faster)
lexicon: Uses built-in lexicon lookup (limited coverage)
auto: Falls back through available engines
dummy: Simple rule-based fallback (no dependencies)

🏗️ Architecture

TugaPhone uses a hierarchical tokenization model:

Sentence → Words → Graphemes → Characters

Each level applies context-sensitive phonological rules:

Character level: Vowel quality, consonant allophones
Grapheme level: Digraphs (ch, nh), diphthongs (ai, ou)
Word level: Stress assignment, syllabification
Sentence level: Prosodic boundaries (future: liaison, phrasal stress)

The phonemization process:

Normalize text (numbers → words)
POS tagging (for homograph disambiguation)
Lexicon lookup (for known words)
Rule-based G2P fallback (for unknown words)
Dialect-specific transformations (regional accents)

⚠️ Limitations & Future Work

Current Limitations

Lexicon coverage: Many words (especially names, foreign words, neologisms) rely solely on rule-based fallback
Sparse coverage: African and Timorese dialects have less lexicon data than European/Brazilian
Lexical variation: Dialect-specific vocabulary (e.g., "trem" vs "comboio") is not handled; text is assumed orthographically consistent
Regional accents: Sub-regional dialects are experimental and approximate
Prosody: Sentence-level features (liaison, phrasal stress, intonation) are simplified
Homograph disambiguation: Limited to POS-based rules; doesn't handle semantic context

🤝 Contributing

Contributions are welcome! Areas where help is especially needed:

Lexicon expansion: Especially for pt-AO, pt-MZ, pt-TL
Regional accent validation: Native speaker verification of dialectal features
Test cases: Edge cases, challenging words, dialectal examples
Documentation: Usage examples, linguistic explanations

📄 License

This project is licensed under the Apache License 2.0. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.github/workflows		.github/workflows
tugaphone		tugaphone
CHANGELOG.md		CHANGELOG.md
README.md		README.md
renovate.json		renovate.json
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TugaPhone — Dialect-aware Portuguese Phonemizer

🚀 Features

📦 Installation

🧰 Usage

Companion libraries

Basic Phonemization

Regional Dialects

Number Normalization

Advanced: Tokenization and Features

📖 Documentation

Supported Dialects

Regional Accents (Experimental)

Part-of-Speech Tagging

🏗️ Architecture

⚠️ Limitations & Future Work

Current Limitations

🤝 Contributing

📄 License

About

Uh oh!

Releases 8

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TugaPhone — Dialect-aware Portuguese Phonemizer

🚀 Features

📦 Installation

🧰 Usage

Companion libraries

Basic Phonemization

Regional Dialects

Number Normalization

Advanced: Tokenization and Features

📖 Documentation

Supported Dialects

Regional Accents (Experimental)

Part-of-Speech Tagging

🏗️ Architecture

⚠️ Limitations & Future Work

Current Limitations

🤝 Contributing

📄 License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 8

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages