The Most Mysterious Bug in My Career
A medical software support engineer traces a recurring XML validation error to an invisible control character injected when doctors copy text from PDFs in Microsoft Edge.
The Most Mysterious Bug in My Career
Background
Our team develops medical software for the Australian healthcare system — specifically, an application for electronic patient referrals. The system automatically extracts data from PMS (Practice Management Systems), including patient information, medical history, and medications. When a referral is sent, it gets converted into HL7, CDA, or PDF format depending on the recipient's requirements.
My role sits somewhere between developer and support — I maintain the system, analyze errors, trace bugs, and document issues for the development team.
The Bug
Every 2–4 weeks, the same error would surface: "Illegal Character entity: expansion character (code 0x2) not a valid XML character." The standard fix was straightforward: run an SQL script that deleted all occurrences of the \u0002 character from the database. Problem solved — until it appeared again a few weeks later.
For a while, this was treated as routine maintenance. But I started asking questions that nobody had bothered with:
- What exactly is the 0x2 character?
- Where in the data does it appear?
- Why does it keep coming back on a regular cycle?
The Investigation
0x2 is the "Start of Text" (STX) control character — a relic from the teletype era, part of the ASCII control character set. It was used in serial communications to mark the beginning of a message body. In modern computing, it has no practical use and should never appear in user-entered text.
I traced the character to the referral letter field — a free-text area that doctors fill in manually when writing referral notes. The field contained the doctor's clinical commentary, observations, and recommendations for the receiving specialist.
Examining the affected records, I noticed a pattern: the problematic referrals all contained text with hard line breaks (hard wraps). The text looked as though the doctor had pressed Enter at the end of every line, rather than letting the text wrap naturally. This was unusual for hand-typed content — people don't typically hit Enter every 70-80 characters when typing in a text box.
The Key Discovery
The doctors weren't typing this text from scratch. They were copying it from PDF copies of previous referrals stored in their PMS. When a referral is sent, a PDF copy is saved in the patient's record. Some doctors, when writing a follow-up referral, would open the previous PDF, select the text, copy it, and paste it into the new referral form.
This explained the hard line breaks — PDF text reflow doesn't work like a word processor. When you copy text from a PDF, you get the text as it was visually laid out, complete with line breaks at the end of each displayed line.
But the crucial detail was even more specific. I tested copying text from PDFs in different applications and browsers. When text contained a hyphenated word at the end of a line (where the word was split across two lines with a hyphen), different applications handled the hyphen differently during copy operations.
In Microsoft Edge, copying a hyphen that appeared at the end of a line in a PDF — specifically a soft hyphen used for word-wrapping — resulted in the clipboard containing the 0x2 (STX) character instead of a hyphen. Other PDF readers and browsers handled this correctly, producing either a standard hyphen or omitting the character entirely.
The Chain of Events
The full sequence that triggered the bug was:
- A doctor creates a referral containing a word with a hyphen — for example, "well-being"
- The referral is sent and a PDF copy is saved in the PMS
- Weeks later, the same or another doctor opens that PDF in Microsoft Edge to reference the previous referral
- They select and copy the text from the PDF
- The hyphen at the end of a wrapped line is converted to the 0x2 character in the clipboard
- The doctor pastes this text into the new referral form
- The invisible 0x2 character is now in the database
- When the new referral is sent and the system tries to generate XML output, the parser rejects the illegal character
The 2–4 week recurrence pattern made sense too — it corresponded to the typical interval between follow-up referrals for the same patient.
The Fix
The original SQL cleanup script simply deleted the 0x2 character. This technically resolved the XML error but introduced a subtle problem: words like "well-being" became "wellbeing" — the hyphen disappeared entirely. For medical text, where compound terms and hyphenated drug names are common, this was not ideal.
I pushed for a proper fix: replace the 0x2 character with a standard hyphen instead of deleting it. The development team updated the sanitization routine to perform this replacement both in the cleanup script and as an input filter when text is saved to the database.
After deploying the updated version, the recurring error disappeared completely. A bug that had persisted for months, dismissed as routine maintenance, turned out to be a fascinating chain of software behaviors: PDF text rendering, browser clipboard handling, character encoding edge cases, and XML parsing strictness — all converging on a single invisible character.
FAQ
What is this article about in one sentence?
This article explains the core idea in practical terms and focuses on what you can apply in real work.
Who is this article for?
It is written for engineers, technical leaders, and curious readers who want a clear, implementation-focused explanation.
What should I read next?
Use the related articles below to continue with closely connected topics and concrete examples.