Sifr treats every boundary between bytes and text as an explicit, typed decision. There is no locale-derived default encoding, no implicit coercion from bytes to str, and no process-global locale mutation. You name the encoding, you get a typed error if the bytes don’t conform, and you can choose an error-recovery handler if you need one. This design makes text bugs visible at compile time rather than hiding them as runtime surprises.
Encoding: sifr.encoding
sifr.encoding is the byte/text conversion surface. Use it to encode a str to bytes or decode bytes to str with an explicitly named codec.
Codec Descriptors
Build a codec descriptor by calling one of the constructor functions:
from sifr.encoding import ascii, latin1, decode, encode
data: bytes = encode("café", latin1())
text: str = decode(b"caf\xe9", latin1())
print(data) # b'caf\xe9'
print(text) # café
Available Tier 0 descriptors (always supported):
| Function | Encoding |
|---|
utf8() | UTF-8 |
utf8_sig() | UTF-8 with BOM |
ascii() | 7-bit ASCII |
latin1() | Latin-1 / ISO-8859-1 |
utf16_le() | UTF-16 little-endian |
utf16_be() | UTF-16 big-endian |
Tier 1 Windows-125x encodings (e.g., windows1252()) are available through encoding_rs. Tier 2 CJK and UTF-32 are deferred to a future release.
Error Handlers
By default, decode raises DecodeError on an invalid byte sequence. Supply an error handler to recover instead:
from sifr.encoding import decode, ascii, replace_decode_handler, DecodeError
try:
# Replace invalid bytes with U+FFFD (replacement character)
recovered: str = decode(b"\xffA", ascii(), replace_decode_handler())
print(recovered) # "\ufffdA"
except DecodeError as e:
print(e.message)
Error handlers are typed values — you pass them explicitly. Dynamic handler registration by name (as in CPython’s codecs.register_error) is unsupported.
Typed Errors
encode raises EncodeError when a character cannot be represented in the target encoding. decode raises DecodeError on invalid byte sequences. Both expose a .message field:
from sifr.encoding import encode, ascii, EncodeError
try:
_: bytes = encode("café", ascii()) # é is not ASCII
except EncodeError as e:
print(e.message)
Text File I/O: sifr.io
Pass a codec descriptor to open_text whenever you open a text file. The encoding= parameter is required — omitting it raises SIFR-IO-0801:
from sifr.encoding import latin1
from sifr.io import TextFileHandle, open_text
from sifr.os import remove_file
path: str = "/tmp/sifr_demo.txt"
try:
writer: TextFileHandle = open_text(path, "w", encoding=latin1())
writer.write("café")
writer.close()
reader: TextFileHandle = open_text(path, "r", encoding=latin1())
text: str = reader.read()
reader.close()
remove_file(path)
print(text) # café
except IOError as e:
print(e.message)
See the I/O & Files page for full filesystem coverage.
Unicode: sifr.unicode
sifr.unicode provides normalization, scalar properties, and text segmentation using Unicode 17.0.0 data tables compiled into the Sifr runtime.
Normalization
normalize accepts a normalization form constant and a str, and returns the normalized form:
from sifr.unicode import NFC, NFD, NFKC, NFKD, normalize
composed: str = normalize(NFC, "e\u0301") # é (single code point)
decomposed: str = normalize(NFD, "é") # e + combining acute
Scalar Properties
name returns the Unicode name of a scalar; category returns the two-letter general category:
from sifr.unicode import name, category
snowman: str = name("\u2603") # SNOWMAN
kind: str = category("A") # Lu (uppercase letter)
Grapheme and Word Segmentation
graphemes splits a string into user-perceived grapheme clusters. words splits into word tokens, filtering punctuation and whitespace:
from sifr.unicode import graphemes, words
clusters: list[str] = graphemes("a\u0301b") # ["á", "b"]
tokens: list[str] = words("Hi, κόσμε!") # ["Hi", "κόσμε"]
print(len(clusters)) # 2
print(len(tokens)) # 2
Unicode 17.0.0 data covers normalization, names, scalar properties, numeric values, case folding, grapheme boundaries, and word boundaries. Sentence boundaries and streaming segmentation cursors are deferred to a future release.
sifr.unicodedata is not a production API in this release. Do not import it. Use sifr.unicode instead.
Locale and I18n: sifr.i18n
sifr.i18n provides locale-aware number formatting, plural rules, and translation bundles. All state is scoped to explicit objects — there is no global locale, no locale.setlocale, and no gettext.install.
Locale Identifiers
Create a LocaleId from a BCP 47 tag:
from sifr.i18n import LocaleId
locale = LocaleId("en-US")
fr_locale = LocaleId("fr")
host_locale() returns the host’s current locale as a read-only LocaleId. It cannot be used to make implicit text encodings legal.
NumberFormatter formats a numeric string according to locale conventions:
from sifr.i18n import LocaleId, NumberFormatter
formatter: NumberFormatter = NumberFormatter(LocaleId("en-US"))
try:
formatted: str = formatter.format("12345")
print(formatted) # 12,345
except Error as e:
print(e.message)
Plural Rules
PluralRules selects the grammatical plural category for a given quantity:
from sifr.i18n import LocaleId, PluralRules, PLURAL_CARDINAL
rules: PluralRules = PluralRules(LocaleId("en"), PLURAL_CARDINAL)
try:
one_cat: str = rules.category("1") # one
two_cat: str = rules.category("2") # other
print(one_cat, two_cat)
except Error as e:
print(e.message)
Translation Bundles
Load translation catalogs from .mo file bytes and compose them into a Translator with an explicit fallback chain:
from sifr.i18n import Bundle, Translator, bundle_from_mo_bytes, translator
try:
primary: Bundle = bundle_from_mo_bytes(primary_catalog_bytes)
fallback: Bundle = bundle_from_mo_bytes(fallback_catalog_bytes)
tx: Translator = translator(primary).with_fallback(fallback)
hello: str = tx.translate("hello")
files: str = tx.translate_plural("file", "files", 2)
backup: str = tx.translate("backup") # falls through to fallback bundle
print(hello) # bonjour
print(files) # fichiers
print(backup) # secours
except Error as e:
print(e.message)
The .mo format is a compatibility backend behind the native Bundle / Translator API. Catalog parsing uses the encoding substrate for declared charsets and rejects unsupported plural expressions with CatalogError.
Build Translator chains in order of preference. with_fallback chains are evaluated left-to-right: the primary bundle is tried first, then each fallback in the order you added it.
Full Text and I18n Demo
The following is the complete text/i18n demo from the Sifr repository:
from sifr.encoding import ascii, decode, encode, latin1, replace_decode_handler
from sifr.encoding import DecodeError, EncodeError
from sifr.i18n import (
Bundle,
LocaleId,
NumberFormatter,
Translator,
bundle_from_mo_bytes,
translator,
)
from sifr.io import TextFileHandle, open_text
from sifr.os import remove_file
from sifr.unicode import NFC, category, graphemes, normalize, words
from sifr.unicode import UnicodeDataError
def collect_demo_checks() -> list[bool]:
checks: list[bool] = []
# Non-UTF-8 byte/text boundaries are explicit.
try:
latin_bytes: bytes = encode("caf\u00e9", latin1())
checks.append(latin_bytes == b"caf\xe9")
except EncodeError as e:
_ = e.message
checks.append(False)
try:
latin_text: str = decode(b"caf\xe9", latin1())
checks.append(latin_text == "caf\u00e9")
except DecodeError as e:
_ = e.message
checks.append(False)
try:
recovered_text: str = decode(b"\xffA", ascii(), replace_decode_handler())
checks.append(recovered_text == "\ufffdA")
except DecodeError as e:
_ = e.message
checks.append(False)
# Text I/O uses the same explicit encoding substrate.
path: str = "/tmp/sifr_text_i18n_demo_latin1.txt"
try:
writer: TextFileHandle = open_text(path, "w", encoding=latin1())
_written: None = writer.write("caf\u00e9")
writer.close()
reader: TextFileHandle = open_text(path, "r", encoding=latin1())
text: str = reader.read()
reader.close()
_removed: None = remove_file(path)
checks.append(text == "caf\u00e9")
except IOError as e:
_ = e.message
checks.append(False)
# Unicode core and segmentation APIs use Unicode 17.0.0 data.
try:
composed: str = normalize(NFC, "e\u0301")
clusters: list[str] = graphemes("a\u0301b")
tokens: list[str] = words("Hi, \u03ba\u03cc\u03c3\u03bc\u03b5!")
letter_category: str = category("A")
checks.append(composed == "\u00e9")
checks.append(letter_category == "Lu")
checks.append(len(clusters) == 2)
checks.append(len(tokens) == 2)
except UnicodeDataError as e:
_ = e.message
checks.append(False)
# Locale-sensitive formatting is object-scoped.
try:
formatter: NumberFormatter = NumberFormatter(LocaleId("en-US"))
formatted: str = formatter.format("12345")
checks.append(len(formatted) > 0)
except Error as e:
_ = e.message
checks.append(False)
# Translation bundles use explicit fallbacks and plural lookup.
try:
primary: Bundle = bundle_from_mo_bytes(primary_catalog_bytes())
fallback: Bundle = bundle_from_mo_bytes(fallback_catalog_bytes())
tx: Translator = translator(primary).with_fallback(fallback)
hello: str = tx.translate("hello")
files: str = tx.translate_plural("file", "files", 2)
backup: str = tx.translate("backup")
checks.append(hello == "bonjour")
checks.append(files == "fichiers")
checks.append(backup == "secours")
except Error as e:
_ = e.message
checks.append(False)
return checks
What Is Not in This Release
The following CPython-shaped names are intentionally absent and raise diagnostics if imported:
| Rejected import | Use instead |
|---|
codecs, sifr.codecs | sifr.encoding |
encodings, sifr.encodings | sifr.encoding |
unicodedata, sifr.unicodedata | sifr.unicode |
locale, sifr.locale | sifr.i18n |
gettext, sifr.gettext | sifr.i18n |
These adapters may be reviewed for future phases, but any future compatibility wrappers must wrap the native Sifr substrate without process-global mutation.