Guide
Fix UTF-8 BOM, Zero-Width Spaces, and MIME Encoding Bugs in Email
According to a 2023 analysis by Mailgun, 3.2% of all business emails contain at least one invisible Unicode character that can break automated processing. This guide shows how to detect UTF-8 BOMs, zero-width spaces, quoted-printable encoding errors, and MIME charset mismatches using the Nylas CLI and standard Unix hex tools. Works with Gmail, Outlook, Exchange, Yahoo, iCloud, and IMAP.
By Pouya Sanooei
When strings match visually but fail programmatically
You write a script that filters emails by subject line. It works for most messages but silently skips one. You copy the subject, paste it into your script, and it still doesn’t match. Two identical-looking strings. Not identical.
According to a 2023 analysis by Mailgun, 3.2% of business emails contain at least one invisible Unicode character. These non-printing code points have no visible glyph but add extra bytes. Your code sees them; your eyes don’t. The problem affects:
- Subject line filters and search queries
- From/To address matching
- Attachment filenames (especially when downloaded to disk)
- Email body parsing with regex or string operations
- Calendar event titles created from email content
The usual suspects
| Character | Unicode | Hex bytes (UTF-8) | Common source |
|---|---|---|---|
| UTF-8 BOM | U+FEFF | EF BB BF | Windows text editors, Excel CSV export |
| Non-breaking space | U+00A0 | C2 A0 | Copy-paste from web pages, macOS Option+Space |
| Zero-width space | U+200B | E2 80 8B | Rich text editors, HTML copy-paste |
| Zero-width joiner | U+200D | E2 80 8D | Emoji sequences, Arabic/Hindi text |
| Zero-width non-joiner | U+200C | E2 80 8C | Persian/Arabic text, HTML editors |
| Soft hyphen | U+00AD | C2 AD | Word processors, hyphenation engines |
| Right-to-left override | U+202E | E2 80 AE | Malicious filenames, bidirectional text |
| Word joiner | U+2060 | E2 81 A0 | Typesetting software |
| Smart quotes (left) | U+201C | E2 80 9C | Microsoft Office, macOS auto-correct |
| Em dash | U+2014 | E2 80 94 | Microsoft Office, macOS auto-correct |
Step 1: Extract raw email data
Start by getting the email content as structured JSON. The --json flag gives you access to every field without the email client's rendering layer hiding characters.
# Get the full email as JSON
nylas email read msg_abc123 --json
# Extract just the subject and pipe to hex viewer
nylas email read msg_abc123 --json | jq -r '.subject' | xxd
# Check the From field
nylas email read msg_abc123 --json | jq -r '.from[0].name' | xxd
# Check attachment filenames
nylas email read msg_abc123 --json | jq -r '.attachments[].filename' | xxdStep 2: Inspect bytes with xxd and hexdump
xxd and hexdump show every byte in the string, including invisible ones. Here is what to look for:
# A clean subject line looks like this:
echo "Weekly Report" | xxd
# 00000000: 5765 656b 6c79 2052 6570 6f72 740a Weekly Report.
# A subject with a hidden UTF-8 BOM at the start:
printf '\xef\xbb\xbfWeekly Report' | xxd
# 00000000: efbb bf57 6565 6b6c 7920 5265 706f 7274 ...Weekly Report
# ^^^^^^
# UTF-8 BOM -- invisible but breaks string comparison
# A subject with a non-breaking space instead of regular space:
printf 'Weekly\xc2\xa0Report' | xxd
# 00000000: 5765 656b 6c79 c2a0 5265 706f 7274 Weekly..Report
# ^^^^
# Non-breaking space (U+00A0) instead of 0x20
# A subject with a zero-width space:
printf 'Weekly\xe2\x80\x8b Report' | xxd
# 00000000: 5765 656b 6c79 e280 8b20 5265 706f 7274 Weekly... Report
# ^^^^^^^^
# Zero-width space -- completely invisibleStep 3: Automate invisible character detection
You can build a quick detection script that scans email subjects for common invisible characters:
# Scan recent emails for invisible characters in subjects
nylas email list --json --limit 50 | jq -r '.[].subject' | while IFS= read -r subject; do
# Check for common invisible characters
if echo "$subject" | grep -qP '[\x{200B}\x{200C}\x{200D}\x{FEFF}\x{00AD}\x{2060}\x{202E}]'; then
echo "FOUND: $subject"
echo "$subject" | xxd | head -5
echo "---"
fi
done
# Or check a single email's subject for non-ASCII bytes
nylas email read msg_abc123 --json | jq -r '.subject' | \
LC_ALL=C grep -P '[^\x20-\x7E]' && echo "Contains non-ASCII" || echo "Clean ASCII"
# Check what encoding the email claims to use
nylas email read msg_abc123 --json | jq '.headers' | grep -i content-typeStep 4: Strip invisible characters
Once you have identified the problem, here are the fixes:
# Strip UTF-8 BOM from a string
echo "$SUBJECT" | sed 's/^\xEF\xBB\xBF//'
# Replace non-breaking spaces with regular spaces
echo "$SUBJECT" | sed 's/\xC2\xA0/ /g'
# Remove all zero-width characters
echo "$SUBJECT" | perl -CSD -pe 's/[\x{200B}\x{200C}\x{200D}\x{FEFF}\x{2060}]//g'
# Nuclear option: strip everything outside printable ASCII + common Unicode
echo "$SUBJECT" | perl -CSD -pe 's/[^\x20-\x7E\x{00C0}-\x{024F}\x{0400}-\x{04FF}]//g'
# Convert encoding if the email uses a non-UTF-8 charset
nylas email read msg_abc123 --json | jq -r '.body' | iconv -f ISO-8859-1 -t UTF-8
# Check file encoding of a downloaded attachment
file --mime-encoding attachment.csv
# attachment.csv: utf-8-bom
# Fix it:
sed -i '1s/^\xEF\xBB\xBF//' attachment.csvInvisible characters in attachment filenames
Attachment filenames are especially prone to invisible character issues. The filename in the email header might contain right-to-left override characters that make report.pdf appear as fdp.troper in some contexts -- a known attack vector.
# List all attachment filenames from recent emails
nylas email list --json --limit 20 | \
jq -r '.[].attachments[]?.filename // empty' | \
while IFS= read -r filename; do
# Check each filename for suspicious characters
if echo "$filename" | grep -qP '[\x{202E}\x{200F}\x{200E}\x{202B}\x{202A}]'; then
echo "WARNING: Bidirectional override in filename: $filename"
echo "$filename" | xxd
fi
done
# Safely download an attachment (sanitizes filename automatically)
nylas email read msg_abc123 --json | \
jq -r '.attachments[0].filename' | \
tr -d '\200-\237' > /dev/null # Strip C1 control charactersUnderstanding email encoding headers
Email uses several headers to declare encoding. When these are wrong or missing, invisible character issues multiply:
# Check the Content-Type and charset of an email
nylas email read msg_abc123 --json | jq '{
content_type: .headers["content-type"],
transfer_encoding: .headers["content-transfer-encoding"],
subject_raw: .subject
}'
# Common charset declarations:
# Content-Type: text/plain; charset="UTF-8" -- modern, correct
# Content-Type: text/plain; charset="ISO-8859-1" -- Western European
# Content-Type: text/plain; charset="windows-1252" -- Windows Western
# Content-Type: text/plain; charset="us-ascii" -- sometimes lies about non-ASCII contentThe most common encoding bug: an email declares charset=us-ascii or charset=ISO-8859-1 but actually contains UTF-8 text. The non-ASCII bytes get misinterpreted, producing mojibake (garbled characters) or invisible corruption.
Sanitize email content before feeding to LLMs
If you are piping email content to an LLM (via Nylas CLI's MCP server or subprocess), invisible characters can confuse the model or trigger unexpected behavior:
# Clean email body before sending to an LLM
nylas email read msg_abc123 --json | jq -r '.body' | \
perl -CSD -pe '
s/[\x{200B}-\x{200D}\x{FEFF}\x{2060}]//g; # Remove zero-width chars
s/[\x{202A}-\x{202E}]//g; # Remove bidi overrides
s/\x{00A0}/ /g; # Non-breaking space to space
s/[\x{2018}\x{2019}]/'"'"'/g; # Smart quotes to ASCII
s/[\x{201C}\x{201D}]/"/g; # Smart double quotes
s/\x{2014}/--/g; # Em dash to double hyphen
s/\x{2013}/-/g; # En dash to hyphen
'Quoted-printable vs base64: when MIME encoding breaks
Email uses two content-transfer-encodings defined in RFC 2045: quoted-printable and base64. Quoted-printable encodes non-ASCII bytes as =XX hex pairs (e.g., =C3=A9 for é). Base64 encodes the entire body as ASCII. When the declared encoding doesn’t match the actual encoding, you get garbled text or invisible corruption.
# Check Content-Transfer-Encoding of an email
nylas email read msg_abc123 --json | jq '{
content_type: .headers["content-type"],
transfer_encoding: .headers["content-transfer-encoding"]
}'
# Decode quoted-printable manually
echo "=C3=A9" | python3 -c "import quopri,sys; sys.stdout.buffer.write(quopri.decode(sys.stdin.buffer.read()))"
# Output: é
# Decode base64
echo "w6k=" | base64 -d
# Output: é
# Common bug: email says quoted-printable but body has raw UTF-8
# Fix: re-encode as UTF-8
nylas email read msg_abc123 --json | jq -r '.body' | iconv -f UTF-8 -t UTF-8//IGNOREThe most common MIME bug: an email declares Content-Transfer-Encoding: quoted-printable but the body contains raw UTF-8 bytes. The =XX sequences are never applied. According to RFC 2045, quoted-printable requires that any byte outside the 33-126 range (except space and tab) be encoded. When they’re not, decoders may misinterpret the bytes.
Preventing invisible character issues
- Always use
--jsonwhen processing email programmatically. The JSON output preserves the exact bytes without terminal rendering. - Validate encoding before processing. Check the
Content-Typecharset header and convert if needed withiconv. - Sanitize user input before composing emails. Strip zero-width characters and normalize Unicode (NFC form) before passing to
nylas email send. - Use binary-safe comparisons in your scripts. Compare bytes, not rendered glyphs.
- Log hex representations when debugging string mismatches. If two strings look the same but are not equal,
xxdwill show why.
Frequently asked questions
Why does copy-pasting from Gmail add invisible characters?
Gmail's web interface renders email as HTML. When you copy text, the browser includes formatting characters like non-breaking spaces (from ), zero-width joiners (from CSS word-break handling), and smart quotes (from automatic substitution). These are invisible when pasted into a terminal or text editor but exist as bytes in the string.
Can invisible characters break email delivery?
In headers (From, To, Subject), yes. If a recipient address contains a zero-width space, the SMTP server will reject it as malformed. In the body, invisible characters are usually harmless for delivery but can break downstream processing.
How do I tell if a character is invisible vs just a rendering issue?
Pipe the string through xxd. If you see bytes between the visible characters that are not 20 (regular space) or 0a (newline), those are the invisible characters. Cross-reference the hex bytes against the Unicode table above.
Are invisible characters a security risk in email?
Yes. Right-to-left override (U+202E) can disguise malicious filenames: report[RLO]fdp.exe renders as reportexe.pdf. Zero-width characters can bypass content filters and spam detection. Homoglyph attacks use visually similar characters from different Unicode blocks to spoof addresses. Always sanitize filenames and validate sender addresses at the byte level.
Does Nylas CLI normalize Unicode automatically?
Nylas CLI passes through the exact bytes from the email provider. It does not normalize or strip invisible characters, which is the correct behavior -- you want to see what is actually in the email. Use the techniques in this guide to sanitize when needed.
Next steps
- Secure email handling from the CLI -- GPG encryption, sender verification, and safe attachment handling
- Send email from the terminal -- compose and send email with proper encoding
- E2E email testing with Playwright -- verify email content in automated tests
- Full command reference -- every flag, subcommand, and example