Encoding Surprises: When pip Assumes Latin-1 Instead of UTF-8
Hardcoded Latin-1 encoding in HTTP auth headers causes UnicodeEncodeError for non-Latin usernames. The fix switches to UTF-8, which handles the full Unicode range.

The bottom line: Hardcoded Latin-1 encoding in HTTP auth headers causes UnicodeEncodeError for non-Latin usernames. If your code uses
.encode('latin-1')on user-supplied strings, it will crash for any user with CJK, Arabic, Cyrillic, or emoji characters.
The Problem
pypa/pip issue #13922 exposes a subtle edge case in how pip handles encoding boundary conditions. The fix is only 11 lines, but the pattern behind it applies across projects.
PR: https://github.com/pypa/pip/pull/14104 Status: Submitted (awaiting review)
Hardcoded character encodings are a ticking time bomb. When code assumes latin-1 for string encoding, it works for English, German, and most Western European users — but breaks for anyone with Chinese, Japanese, Korean, Arabic, or emoji in their input.
import base64
# Before: Latin-1 breaks non-Latin characters
def basic_auth_header(username, password):
raw = f'{username}:{password}'.encode('latin-1')
# UnicodeEncodeError if username contains non-Latin chars
return 'Basic ' + base64.b64encode(raw).decode()
# After: UTF-8 handles the full Unicode range
def basic_auth_header(username, password):
raw = f'{username}:{password}'.encode('utf-8')
return 'Basic ' + base64.b64encode(raw).decode()
How to Reproduce
To trigger this bug yourself:
- Install pip with a username containing non-ASCII characters (e.g.,
用户名) - Configure a private PyPI index requiring Basic HTTP auth
- Run
pip installwith that index — observeUnicodeEncodeError - Check stack trace for
.encode('latin-1')in the auth header construction
The error path is: username input → HTTP Basic auth header construction → str.encode('latin-1') → crash.
Why Latin-1 Fails
Latin-1 (ISO 8859-1) encodes exactly 256 characters — the first 128 match ASCII, the next 128 cover Western European accented characters. Any code point above U+00FF (č, 东, 😊, etc.) raises UnicodeEncodeError.
UTF-8 encodes the full Unicode range (1,112,064 code points) while remaining backward-compatible with ASCII for the first 128 characters. There is no valid reason to use Latin-1 over UTF-8 for user-provided strings in modern Python.
Where This Pattern Hides
The hardcoded Latin-1 pattern appears in:
- HTTP auth helpers — constructing Basic auth headers from user credentials
- Legacy protocol parsers — RFC 2047/2231 parameter encoding
- Binary protocol framing — encoding length-prefixed string fields
- CSV/write paths — explicit encoding choice that never sees non-ASCII test data
Each of these shares the same root cause: the developer tested with ASCII-only data and never hit the failing path.
Defensive Encoding Checklist
When handling user-supplied strings that need encoding:
- Never hardcode
latin-1for user input — useutf-8everywhere - Use
errors='replace'as a safety net:s.encode('utf-8', errors='replace') - Test with non-ASCII data — add at least one CJK or emoji character per string parameter
- Check for
.encode()in code review — any explicit encoding argument should justify why UTF-8 won’t work - Use
locale.getencoding()(Python 3.11+) for locale-aware encoding instead of hardcoding
Key Takeaway
Never hardcode Latin-1 for user-provided strings. Always use UTF-8 — it’s backward-compatible with ASCII and handles the full Unicode range. The error won’t appear in testing with English data, which is exactly what makes it dangerous: it passes CI and breaks for international users in production.
Discovered while fixing pypa/pip#13922. View the fix post for the specific diff.