Guide

Fix UTF-8 BOM, Zero-Width Spaces, and MIME Bugs

According to a 2023 analysis by Mailgun, 3.2% of all business emails contain at least one invisible Unicode character that can break automated processing. This guide shows how to detect UTF-8 BOMs, zero-width spaces, quoted-printable encoding errors, and MIME charset mismatches using the Nylas CLI and standard Unix hex tools. Works across all major email providers.

Written by Pouya Sanooei Software Engineer

Reviewed by Hazik

Updated May 10, 2026

Verified — CLI 3.1.1 · Gmail, Outlook · last tested April 11, 2026

When strings match visually but fail programmatically

Invisible Unicode characters cause string comparisons to fail even when two values look identical on screen. Non-printing code points like zero-width spaces (U+200B) and UTF-8 byte order marks (U+FEFF) add extra bytes with no visible glyph. According to a 2023 analysis by Mailgun, 3.2% of business emails contain at least one such character.

You write a script that filters emails by subject line. It works for most messages but silently skips one. You copy the subject, paste it into your script, and it still doesn’t match. The bytes are different even though the rendered text is identical. The problem affects:

Subject line filters and search queries
From/To address matching
Attachment filenames (especially when downloaded to disk)
Email body parsing with regex or string operations
Calendar event titles created from email content

The usual suspects

Ten invisible Unicode characters cause most email processing failures. The Unicode Standard defines over 149,000 characters across 161 scripts, but the following 10 code points account for the vast majority of invisible-character bugs in email headers and bodies. Each entry shows the UTF-8 hex bytes to search for with xxd or hexdump.

Character	Unicode	Hex bytes (UTF-8)	Common source
UTF-8 BOM	U+FEFF	`EF BB BF`	Windows text editors, Excel CSV export
Non-breaking space	U+00A0	`C2 A0`	Copy-paste from web pages, macOS Option+Space
Zero-width space	U+200B	`E2 80 8B`	Rich text editors, HTML copy-paste
Zero-width joiner	U+200D	`E2 80 8D`	Emoji sequences, Arabic/Hindi text
Zero-width non-joiner	U+200C	`E2 80 8C`	Persian/Arabic text, HTML editors
Soft hyphen	U+00AD	`C2 AD`	Word processors, hyphenation engines
Right-to-left override	U+202E	`E2 80 AE`	Malicious filenames, bidirectional text
Word joiner	U+2060	`E2 81 A0`	Typesetting software
Smart quotes (left)	U+201C	`E2 80 9C`	Microsoft Office, macOS auto-correct
Em dash	U+2014	`E2 80 94`	Microsoft Office, macOS auto-correct

Step 1: Extract raw email data

Extracting raw email data as JSON bypasses the rendering layer that hides invisible characters. Email clients strip or normalize non-printing code points during display, so a visual inspection never reveals the actual bytes. The Nylas CLI's --json flag returns every field as raw UTF-8, preserving the exact byte sequence from the provider.

A single email header can contain up to 998 bytes per line according to RFC 5322. Invisible characters consume 2-4 bytes each in UTF-8 encoding, so even one zero-width space in a subject line adds 3 bytes that are invisible in terminal output but present in the data.

# Get the full email as JSON
nylas email read msg_abc123 --json

# Extract just the subject and pipe to hex viewer
nylas email read msg_abc123 --json | jq -r '.subject' | xxd

# Check the From field
nylas email read msg_abc123 --json | jq -r '.from[0].name' | xxd

# Check attachment filenames
nylas email read msg_abc123 --json | jq -r '.attachments[].filename' | xxd

Step 2: Inspect bytes with xxd and hexdump

Hex viewers like xxd and hexdump -C display every byte in a string, making invisible characters visible as hex pairs. A regular ASCII space is byte 20, but a non-breaking space is the two-byte sequence C2 A0 and a zero-width space is the three-byte sequence E2 80 8B. Comparing hex output against the Unicode table above identifies the exact invisible character.

The examples below show the difference between clean text and text contaminated with UTF-8 BOM, non-breaking spaces, and zero-width spaces. Each example includes the xxd output with the offending bytes marked.

# A clean subject line looks like this:
echo "Weekly Report" | xxd
# 00000000: 5765 656b 6c79 2052 6570 6f72 740a       Weekly Report.

# A subject with a hidden UTF-8 BOM at the start:
printf '\xef\xbb\xbfWeekly Report' | xxd
# 00000000: efbb bf57 6565 6b6c 7920 5265 706f 7274  ...Weekly Report
#           ^^^^^^
#           UTF-8 BOM -- invisible but breaks string comparison

# A subject with a non-breaking space instead of regular space:
printf 'Weekly\xc2\xa0Report' | xxd
# 00000000: 5765 656b 6c79 c2a0 5265 706f 7274       Weekly..Report
#                         ^^^^
#                         Non-breaking space (U+00A0) instead of 0x20

# A subject with a zero-width space:
printf 'Weekly\xe2\x80\x8b Report' | xxd
# 00000000: 5765 656b 6c79 e280 8b20 5265 706f 7274  Weekly... Report
#                         ^^^^^^^^
#                         Zero-width space -- completely invisible

Step 3: Automate invisible character detection

Automated scanning catches invisible characters across an entire mailbox instead of inspecting one email at a time. A grep pattern targeting the 7 most common invisible code points (U+200B through U+202E) flags contaminated subjects in seconds. Running this against 50 recent messages typically surfaces 1-2 matches in a business mailbox, based on the 3.2% prevalence rate from Mailgun's data.

The script below pipes Nylas CLI JSON output through jq to extract subjects, then uses Perl-compatible regex (grep -P) to match invisible Unicode ranges. A second example checks whether a single email's subject contains any non-ASCII bytes at all.

# Scan recent emails for invisible characters in subjects
nylas email list --json --limit 50 | jq -r '.[].subject' | while IFS= read -r subject; do
  # Check for common invisible characters
  if echo "$subject" | grep -qP '[\x{200B}\x{200C}\x{200D}\x{FEFF}\x{00AD}\x{2060}\x{202E}]'; then
    echo "FOUND: $subject"
    echo "$subject" | xxd | head -5
    echo "---"
  fi
done

# Or check a single email's subject for non-ASCII bytes
nylas email read msg_abc123 --json | jq -r '.subject' | \
  LC_ALL=C grep -P '[^\x20-\x7E]' && echo "Contains non-ASCII" || echo "Clean ASCII"

# Check what encoding the email claims to use
nylas email read msg_abc123 --json | jq '.headers' | grep -i content-type

Step 4: Strip invisible characters

Stripping invisible characters requires different tools depending on the character type. A UTF-8 BOM is a fixed 3-byte prefix (EF BB BF) that sed can remove in a single substitution. Non-breaking spaces are also 2 bytes each and map cleanly to regular spaces. Zero-width characters span 5 Unicode code points (U+200B, U+200C, U+200D, U+FEFF, U+2060) and need Perl's Unicode-aware regex to match reliably.

The iconv utility handles charset conversion when an email declares ISO-8859-1 but actually contains UTF-8 bytes. The file --mime-encoding command detects BOM presence in downloaded attachments, reporting utf-8-bom instead of plain utf-8.

# Strip UTF-8 BOM from a string
echo "$SUBJECT" | sed 's/^\xEF\xBB\xBF//'

# Replace non-breaking spaces with regular spaces
echo "$SUBJECT" | sed 's/\xC2\xA0/ /g'

# Remove all zero-width characters
echo "$SUBJECT" | perl -CSD -pe 's/[\x{200B}\x{200C}\x{200D}\x{FEFF}\x{2060}]//g'

# Nuclear option: strip everything outside printable ASCII + common Unicode
echo "$SUBJECT" | perl -CSD -pe 's/[^\x20-\x7E\x{00C0}-\x{024F}\x{0400}-\x{04FF}]//g'

# Convert encoding if the email uses a non-UTF-8 charset
nylas email read msg_abc123 --json | jq -r '.body' | iconv -f ISO-8859-1 -t UTF-8

# Check file encoding of a downloaded attachment
file --mime-encoding attachment.csv
# attachment.csv: utf-8-bom
# Fix it:
sed -i '1s/^\xEF\xBB\xBF//' attachment.csv

Invisible characters in attachment filenames

Attachment filenames are a primary vector for invisible-character attacks because mail clients display the filename without revealing its raw bytes. The right-to-left override character (U+202E) reverses the visual rendering of subsequent text, so a file named report[U+202E]fdp.exe displays as reportexe.pdf in many file managers. According to Unicode Technical Report #36, 5 bidirectional override code points (U+202A through U+202E) can manipulate filename display.

The script below extracts attachment filenames from recent emails via Nylas CLI and checks each one for bidirectional override characters using Perl-compatible regex. Any match warrants manual inspection of the file before opening it.

# List all attachment filenames from recent emails
nylas email list --json --limit 20 | \
  jq -r '.[].attachments[]?.filename // empty' | \
  while IFS= read -r filename; do
    # Check each filename for suspicious characters
    if echo "$filename" | grep -qP '[\x{202E}\x{200F}\x{200E}\x{202B}\x{202A}]'; then
      echo "WARNING: Bidirectional override in filename: $filename"
      echo "$filename" | xxd
    fi
  done

# Safely download an attachment (sanitizes filename automatically)
nylas email read msg_abc123 --json | \
  jq -r '.attachments[0].filename' | \
  tr -d '\200-\237' > /dev/null  # Strip C1 control characters

Understanding email encoding headers

Email encoding headers declare the character set and transfer encoding used in the message body. RFC 2045 defines Content-Type (which specifies the charset, such as UTF-8 or ISO-8859-1) and Content-Transfer-Encoding (which specifies whether the body uses quoted-printable, base64, or 7-bit encoding). When these headers are wrong or missing, mail clients misinterpret non-ASCII bytes, producing mojibake or silent invisible-character corruption.

ISO-8859-1 covers 191 printable characters from Western European languages. UTF-8 covers all 149,813 characters in Unicode 16.0. The most common encoding bug happens when an email declares charset=us-ascii (which covers only 95 printable characters) but the body contains multi-byte UTF-8 sequences.

# Check the Content-Type and charset of an email
nylas email read msg_abc123 --json | jq '{
  content_type: .headers["content-type"],
  transfer_encoding: .headers["content-transfer-encoding"],
  subject_raw: .subject
}'

# Common charset declarations:
# Content-Type: text/plain; charset="UTF-8"        -- modern, correct
# Content-Type: text/plain; charset="ISO-8859-1"   -- Western European
# Content-Type: text/plain; charset="windows-1252" -- Windows Western
# Content-Type: text/plain; charset="us-ascii"     -- sometimes lies about non-ASCII content

A mismatch between declared and actual charset is the single most common cause of garbled email text. If an email header says charset=ISO-8859-1 but the body contains UTF-8 multi-byte sequences, each 2-3 byte UTF-8 character gets split into separate ISO-8859-1 characters, turning “é” into “Ã©”.

Sanitize email content before feeding to LLMs

Email content piped to a language model should be stripped of invisible Unicode characters, bidirectional overrides, and typographic substitutions before processing. LLM tokenizers split input on byte boundaries, and a 3-byte zero-width space (U+200B) consumes a token without contributing meaning. Smart quotes (U+201C, U+201D) and em dashes (U+2014) also tokenize differently from their ASCII equivalents, which can alter model output.

The Perl one-liner below handles 8 categories of problematic characters in a single pass. It converts smart quotes to ASCII, replaces em and en dashes with hyphens, normalizes non-breaking spaces, and removes all zero-width and bidirectional override characters.

# Clean email body before sending to an LLM
nylas email read msg_abc123 --json | jq -r '.body' | \
  perl -CSD -pe '
    s/[\x{200B}-\x{200D}\x{FEFF}\x{2060}]//g;  # Remove zero-width chars
    s/[\x{202A}-\x{202E}]//g;                      # Remove bidi overrides
    s/\x{00A0}/ /g;                                  # Non-breaking space to space
    s/[\x{2018}\x{2019}]/'"'"'/g;                   # Smart quotes to ASCII
    s/[\x{201C}\x{201D}]/"/g;                       # Smart double quotes
    s/\x{2014}/--/g;                                  # Em dash to double hyphen
    s/\x{2013}/-/g;                                   # En dash to hyphen
  '

Quoted-printable vs base64: when MIME encoding breaks

MIME content-transfer-encoding mismatches are a common source of invisible corruption in email bodies. RFC 2045 defines two encodings: quoted-printable, which represents non-ASCII bytes as =XX hex pairs (e.g., =C3=A9 for é), and base64, which encodes the entire body as 7-bit ASCII. Quoted-printable adds roughly 3x overhead per non-ASCII byte, while base64 adds a fixed 33% overhead regardless of content.

mime-encoding.sh

# Check Content-Transfer-Encoding of an email
nylas email read msg_abc123 --json | jq '{
  content_type: .headers["content-type"],
  transfer_encoding: .headers["content-transfer-encoding"]
}'

# Decode quoted-printable manually
echo "=C3=A9" | python3 -c "import quopri,sys; sys.stdout.buffer.write(quopri.decode(sys.stdin.buffer.read()))"
# Output: é

# Decode base64
echo "w6k=" | base64 -d
# Output: é

# Common bug: email says quoted-printable but body has raw UTF-8
# Fix: re-encode as UTF-8
nylas email read msg_abc123 --json | jq -r '.body' | iconv -f UTF-8 -t UTF-8//IGNORE

The most common MIME bug occurs when an email declares Content-Transfer-Encoding: quoted-printable but the body contains raw UTF-8 bytes without =XX encoding applied. According to RFC 2045, quoted-printable requires that any byte outside the 33-126 range (except space and tab at position 9 and 32) be encoded. When a mail server skips this step, downstream decoders may treat multi-byte UTF-8 sequences as separate single-byte characters, producing garbled output.

Preventing invisible character issues

Prevention requires validating encoding at every boundary where email data enters your pipeline. Unicode normalization form NFC (Canonical Decomposition followed by Canonical Composition, defined in Unicode Standard Annex #15) collapses equivalent character sequences into a single representation, eliminating one class of invisible mismatches. Combining NFC normalization with zero-width character stripping and charset validation catches over 95% of invisible character issues before they reach downstream processing.

Always use --json when processing email programmatically. The JSON output preserves the exact bytes without terminal rendering.
Validate encoding before processing. Check the Content-Type charset header and convert if needed with iconv.
Sanitize user input before composing emails. Strip zero-width characters and normalize Unicode (NFC form) before passing to nylas email send.
Use binary-safe comparisons in your scripts. Compare bytes, not rendered glyphs.
Log hex representations when debugging string mismatches. If two strings look the same but are not equal, xxd will show why.

Frequently asked questions

Developers working with email data at the byte level encounter recurring questions about invisible character behavior. The answers below cover the 5 most common questions about invisible Unicode in email, based on patterns seen across Gmail, Outlook, Yahoo, and IMAP providers.

Why does copy-pasting from Gmail add invisible characters?

Gmail's web interface renders email as HTML. When you copy text, the browser includes formatting characters like non-breaking spaces (from  ), zero-width joiners (from CSS word-break handling), and smart quotes (from automatic substitution). These are invisible when pasted into a terminal or text editor but exist as bytes in the string.

Can invisible characters break email delivery?

In headers (From, To, Subject), yes. If a recipient address contains a zero-width space, the SMTP server will reject it as malformed. In the body, invisible characters are usually harmless for delivery but can break downstream processing.

How do I tell if a character is invisible vs just a rendering issue?

Pipe the string through xxd. If you see bytes between the visible characters that are not 20 (regular space) or 0a (newline), those are the invisible characters. Cross-reference the hex bytes against the Unicode table above.

Are invisible characters a security risk in email?

Yes. Right-to-left override (U+202E) can disguise malicious filenames: report[RLO]fdp.exe renders as reportexe.pdf. Zero-width characters can bypass content filters and spam detection. Homoglyph attacks use visually similar characters from different Unicode blocks to spoof addresses. Always sanitize filenames and validate sender addresses at the byte level.

Does Nylas CLI normalize Unicode automatically?

Nylas CLI passes through the exact bytes from the email provider. It does not normalize or strip invisible characters, which is the correct behavior -- you want to see what is actually in the email. Use the techniques in this guide to sanitize when needed.

Next steps

Secure email handling from the CLI -- GPG encryption, sender verification, and safe attachment handling
Send email from the terminal -- compose and send email with proper encoding
E2E email testing with Playwright -- verify email content in automated tests
Full command reference -- every flag, subcommand, and example
RFC 2045 — MIME Part One: Format of Internet Message Bodies -- the formal definition of Content-Transfer-Encoding, including quoted-printable
RFC 2047 — MIME Part Three: Message Header Extensions for Non-ASCII Text -- how Subject and From headers encode non-ASCII bytes
Unicode Technical Report #36 — Unicode Security Considerations -- right-to-left override, homoglyph spoofing, and confusable detection
Unicode Standard Annex #15 — Normalization Forms -- the formal definition of NFC/NFD/NFKC/NFKD used to sanitize input