Encoding Surprises: When pip Assumes Latin-1 Instead of UTF-8

Hardcoded Latin-1 encoding in HTTP auth headers causes UnicodeEncodeError for non-Latin usernames. The fix switches to UTF-8, which handles the full Unicode range.

The bottom line: Hardcoded Latin-1 encoding in HTTP auth headers causes UnicodeEncodeError for non-Latin usernames. If your code uses .encode('latin-1') on user-supplied strings, it will crash for any user with CJK, Arabic, Cyrillic, or emoji characters.


The Problem

pypa/pip issue #13922 exposes a subtle edge case in how pip handles encoding boundary conditions. The fix is only 11 lines, but the pattern behind it applies across projects.

PR: https://github.com/pypa/pip/pull/14104 Status: Submitted (awaiting review)

Hardcoded character encodings are a ticking time bomb. When code assumes latin-1 for string encoding, it works for English, German, and most Western European users — but breaks for anyone with Chinese, Japanese, Korean, Arabic, or emoji in their input.

import base64

# Before: Latin-1 breaks non-Latin characters
def basic_auth_header(username, password):
    raw = f'{username}:{password}'.encode('latin-1')
    # UnicodeEncodeError if username contains non-Latin chars
    return 'Basic ' + base64.b64encode(raw).decode()

# After: UTF-8 handles the full Unicode range
def basic_auth_header(username, password):
    raw = f'{username}:{password}'.encode('utf-8')
    return 'Basic ' + base64.b64encode(raw).decode()

How to Reproduce

To trigger this bug yourself:

  1. Install pip with a username containing non-ASCII characters (e.g., 用户名)
  2. Configure a private PyPI index requiring Basic HTTP auth
  3. Run pip install with that index — observe UnicodeEncodeError
  4. Check stack trace for .encode('latin-1') in the auth header construction

The error path is: username input → HTTP Basic auth header construction → str.encode('latin-1') → crash.

Why Latin-1 Fails

Latin-1 (ISO 8859-1) encodes exactly 256 characters — the first 128 match ASCII, the next 128 cover Western European accented characters. Any code point above U+00FF (č, 东, 😊, etc.) raises UnicodeEncodeError.

UTF-8 encodes the full Unicode range (1,112,064 code points) while remaining backward-compatible with ASCII for the first 128 characters. There is no valid reason to use Latin-1 over UTF-8 for user-provided strings in modern Python.

Where This Pattern Hides

The hardcoded Latin-1 pattern appears in:

  • HTTP auth helpers — constructing Basic auth headers from user credentials
  • Legacy protocol parsers — RFC 2047/2231 parameter encoding
  • Binary protocol framing — encoding length-prefixed string fields
  • CSV/write paths — explicit encoding choice that never sees non-ASCII test data

Each of these shares the same root cause: the developer tested with ASCII-only data and never hit the failing path.

Defensive Encoding Checklist

When handling user-supplied strings that need encoding:

  1. Never hardcode latin-1 for user input — use utf-8 everywhere
  2. Use errors='replace' as a safety net: s.encode('utf-8', errors='replace')
  3. Test with non-ASCII data — add at least one CJK or emoji character per string parameter
  4. Check for .encode() in code review — any explicit encoding argument should justify why UTF-8 won’t work
  5. Use locale.getencoding() (Python 3.11+) for locale-aware encoding instead of hardcoding

Key Takeaway

Never hardcode Latin-1 for user-provided strings. Always use UTF-8 — it’s backward-compatible with ASCII and handles the full Unicode range. The error won’t appear in testing with English data, which is exactly what makes it dangerous: it passes CI and breaks for international users in production.


Discovered while fixing pypa/pip#13922. View the fix post for the specific diff.