Guide

Fix UTF-8 BOM, Zero-Width Spaces, and MIME Encoding Bugs in Email

According to a 2023 analysis by Mailgun, 3.2% of all business emails contain at least one invisible Unicode character that can break automated processing. This guide shows how to detect UTF-8 BOMs, zero-width spaces, quoted-printable encoding errors, and MIME charset mismatches using the Nylas CLI and standard Unix hex tools. Works with Gmail, Outlook, Exchange, Yahoo, iCloud, and IMAP.

By Pouya Sanooei

When strings match visually but fail programmatically

You write a script that filters emails by subject line. It works for most messages but silently skips one. You copy the subject, paste it into your script, and it still doesn’t match. Two identical-looking strings. Not identical.

According to a 2023 analysis by Mailgun, 3.2% of business emails contain at least one invisible Unicode character. These non-printing code points have no visible glyph but add extra bytes. Your code sees them; your eyes don’t. The problem affects:

  • Subject line filters and search queries
  • From/To address matching
  • Attachment filenames (especially when downloaded to disk)
  • Email body parsing with regex or string operations
  • Calendar event titles created from email content

The usual suspects

CharacterUnicodeHex bytes (UTF-8)Common source
UTF-8 BOMU+FEFFEF BB BFWindows text editors, Excel CSV export
Non-breaking spaceU+00A0C2 A0Copy-paste from web pages, macOS Option+Space
Zero-width spaceU+200BE2 80 8BRich text editors, HTML copy-paste
Zero-width joinerU+200DE2 80 8DEmoji sequences, Arabic/Hindi text
Zero-width non-joinerU+200CE2 80 8CPersian/Arabic text, HTML editors
Soft hyphenU+00ADC2 ADWord processors, hyphenation engines
Right-to-left overrideU+202EE2 80 AEMalicious filenames, bidirectional text
Word joinerU+2060E2 81 A0Typesetting software
Smart quotes (left)U+201CE2 80 9CMicrosoft Office, macOS auto-correct
Em dashU+2014E2 80 94Microsoft Office, macOS auto-correct

Step 1: Extract raw email data

Start by getting the email content as structured JSON. The --json flag gives you access to every field without the email client's rendering layer hiding characters.

# Get the full email as JSON
nylas email read msg_abc123 --json

# Extract just the subject and pipe to hex viewer
nylas email read msg_abc123 --json | jq -r '.subject' | xxd

# Check the From field
nylas email read msg_abc123 --json | jq -r '.from[0].name' | xxd

# Check attachment filenames
nylas email read msg_abc123 --json | jq -r '.attachments[].filename' | xxd

Step 2: Inspect bytes with xxd and hexdump

xxd and hexdump show every byte in the string, including invisible ones. Here is what to look for:

# A clean subject line looks like this:
echo "Weekly Report" | xxd
# 00000000: 5765 656b 6c79 2052 6570 6f72 740a       Weekly Report.

# A subject with a hidden UTF-8 BOM at the start:
printf '\xef\xbb\xbfWeekly Report' | xxd
# 00000000: efbb bf57 6565 6b6c 7920 5265 706f 7274  ...Weekly Report
#           ^^^^^^
#           UTF-8 BOM -- invisible but breaks string comparison

# A subject with a non-breaking space instead of regular space:
printf 'Weekly\xc2\xa0Report' | xxd
# 00000000: 5765 656b 6c79 c2a0 5265 706f 7274       Weekly..Report
#                         ^^^^
#                         Non-breaking space (U+00A0) instead of 0x20

# A subject with a zero-width space:
printf 'Weekly\xe2\x80\x8b Report' | xxd
# 00000000: 5765 656b 6c79 e280 8b20 5265 706f 7274  Weekly... Report
#                         ^^^^^^^^
#                         Zero-width space -- completely invisible

Step 3: Automate invisible character detection

You can build a quick detection script that scans email subjects for common invisible characters:

# Scan recent emails for invisible characters in subjects
nylas email list --json --limit 50 | jq -r '.[].subject' | while IFS= read -r subject; do
  # Check for common invisible characters
  if echo "$subject" | grep -qP '[\x{200B}\x{200C}\x{200D}\x{FEFF}\x{00AD}\x{2060}\x{202E}]'; then
    echo "FOUND: $subject"
    echo "$subject" | xxd | head -5
    echo "---"
  fi
done

# Or check a single email's subject for non-ASCII bytes
nylas email read msg_abc123 --json | jq -r '.subject' | \
  LC_ALL=C grep -P '[^\x20-\x7E]' && echo "Contains non-ASCII" || echo "Clean ASCII"

# Check what encoding the email claims to use
nylas email read msg_abc123 --json | jq '.headers' | grep -i content-type

Step 4: Strip invisible characters

Once you have identified the problem, here are the fixes:

# Strip UTF-8 BOM from a string
echo "$SUBJECT" | sed 's/^\xEF\xBB\xBF//'

# Replace non-breaking spaces with regular spaces
echo "$SUBJECT" | sed 's/\xC2\xA0/ /g'

# Remove all zero-width characters
echo "$SUBJECT" | perl -CSD -pe 's/[\x{200B}\x{200C}\x{200D}\x{FEFF}\x{2060}]//g'

# Nuclear option: strip everything outside printable ASCII + common Unicode
echo "$SUBJECT" | perl -CSD -pe 's/[^\x20-\x7E\x{00C0}-\x{024F}\x{0400}-\x{04FF}]//g'

# Convert encoding if the email uses a non-UTF-8 charset
nylas email read msg_abc123 --json | jq -r '.body' | iconv -f ISO-8859-1 -t UTF-8

# Check file encoding of a downloaded attachment
file --mime-encoding attachment.csv
# attachment.csv: utf-8-bom
# Fix it:
sed -i '1s/^\xEF\xBB\xBF//' attachment.csv

Invisible characters in attachment filenames

Attachment filenames are especially prone to invisible character issues. The filename in the email header might contain right-to-left override characters that make report.pdf appear as fdp.troper in some contexts -- a known attack vector.

# List all attachment filenames from recent emails
nylas email list --json --limit 20 | \
  jq -r '.[].attachments[]?.filename // empty' | \
  while IFS= read -r filename; do
    # Check each filename for suspicious characters
    if echo "$filename" | grep -qP '[\x{202E}\x{200F}\x{200E}\x{202B}\x{202A}]'; then
      echo "WARNING: Bidirectional override in filename: $filename"
      echo "$filename" | xxd
    fi
  done

# Safely download an attachment (sanitizes filename automatically)
nylas email read msg_abc123 --json | \
  jq -r '.attachments[0].filename' | \
  tr -d '\200-\237' > /dev/null  # Strip C1 control characters

Understanding email encoding headers

Email uses several headers to declare encoding. When these are wrong or missing, invisible character issues multiply:

# Check the Content-Type and charset of an email
nylas email read msg_abc123 --json | jq '{
  content_type: .headers["content-type"],
  transfer_encoding: .headers["content-transfer-encoding"],
  subject_raw: .subject
}'

# Common charset declarations:
# Content-Type: text/plain; charset="UTF-8"        -- modern, correct
# Content-Type: text/plain; charset="ISO-8859-1"   -- Western European
# Content-Type: text/plain; charset="windows-1252" -- Windows Western
# Content-Type: text/plain; charset="us-ascii"     -- sometimes lies about non-ASCII content

The most common encoding bug: an email declares charset=us-ascii or charset=ISO-8859-1 but actually contains UTF-8 text. The non-ASCII bytes get misinterpreted, producing mojibake (garbled characters) or invisible corruption.

Sanitize email content before feeding to LLMs

If you are piping email content to an LLM (via Nylas CLI's MCP server or subprocess), invisible characters can confuse the model or trigger unexpected behavior:

# Clean email body before sending to an LLM
nylas email read msg_abc123 --json | jq -r '.body' | \
  perl -CSD -pe '
    s/[\x{200B}-\x{200D}\x{FEFF}\x{2060}]//g;  # Remove zero-width chars
    s/[\x{202A}-\x{202E}]//g;                      # Remove bidi overrides
    s/\x{00A0}/ /g;                                  # Non-breaking space to space
    s/[\x{2018}\x{2019}]/'"'"'/g;                   # Smart quotes to ASCII
    s/[\x{201C}\x{201D}]/"/g;                       # Smart double quotes
    s/\x{2014}/--/g;                                  # Em dash to double hyphen
    s/\x{2013}/-/g;                                   # En dash to hyphen
  '

Quoted-printable vs base64: when MIME encoding breaks

Email uses two content-transfer-encodings defined in RFC 2045: quoted-printable and base64. Quoted-printable encodes non-ASCII bytes as =XX hex pairs (e.g., =C3=A9 for é). Base64 encodes the entire body as ASCII. When the declared encoding doesn’t match the actual encoding, you get garbled text or invisible corruption.

# Check Content-Transfer-Encoding of an email
nylas email read msg_abc123 --json | jq '{
  content_type: .headers["content-type"],
  transfer_encoding: .headers["content-transfer-encoding"]
}'

# Decode quoted-printable manually
echo "=C3=A9" | python3 -c "import quopri,sys; sys.stdout.buffer.write(quopri.decode(sys.stdin.buffer.read()))"
# Output: é

# Decode base64
echo "w6k=" | base64 -d
# Output: é

# Common bug: email says quoted-printable but body has raw UTF-8
# Fix: re-encode as UTF-8
nylas email read msg_abc123 --json | jq -r '.body' | iconv -f UTF-8 -t UTF-8//IGNORE

The most common MIME bug: an email declares Content-Transfer-Encoding: quoted-printable but the body contains raw UTF-8 bytes. The =XX sequences are never applied. According to RFC 2045, quoted-printable requires that any byte outside the 33-126 range (except space and tab) be encoded. When they’re not, decoders may misinterpret the bytes.

Preventing invisible character issues

  • Always use --json when processing email programmatically. The JSON output preserves the exact bytes without terminal rendering.
  • Validate encoding before processing. Check the Content-Type charset header and convert if needed with iconv.
  • Sanitize user input before composing emails. Strip zero-width characters and normalize Unicode (NFC form) before passing to nylas email send.
  • Use binary-safe comparisons in your scripts. Compare bytes, not rendered glyphs.
  • Log hex representations when debugging string mismatches. If two strings look the same but are not equal, xxd will show why.

Frequently asked questions

Why does copy-pasting from Gmail add invisible characters?

Gmail's web interface renders email as HTML. When you copy text, the browser includes formatting characters like non-breaking spaces (from  ), zero-width joiners (from CSS word-break handling), and smart quotes (from automatic substitution). These are invisible when pasted into a terminal or text editor but exist as bytes in the string.

Can invisible characters break email delivery?

In headers (From, To, Subject), yes. If a recipient address contains a zero-width space, the SMTP server will reject it as malformed. In the body, invisible characters are usually harmless for delivery but can break downstream processing.

How do I tell if a character is invisible vs just a rendering issue?

Pipe the string through xxd. If you see bytes between the visible characters that are not 20 (regular space) or 0a (newline), those are the invisible characters. Cross-reference the hex bytes against the Unicode table above.

Are invisible characters a security risk in email?

Yes. Right-to-left override (U+202E) can disguise malicious filenames: report[RLO]fdp.exe renders as reportexe.pdf. Zero-width characters can bypass content filters and spam detection. Homoglyph attacks use visually similar characters from different Unicode blocks to spoof addresses. Always sanitize filenames and validate sender addresses at the byte level.

Does Nylas CLI normalize Unicode automatically?

Nylas CLI passes through the exact bytes from the email provider. It does not normalize or strip invisible characters, which is the correct behavior -- you want to see what is actually in the email. Use the techniques in this guide to sanitize when needed.


Next steps