Subscribe to the OSS Weekly Newsletter!

Why Did LLMs Steal Our Em-Dashes?

The em-dash, or “—”, is a writing tool which allows for a clearer expression of complex thoughts, and AI seems to think so too. As well-articulating students, researchers, and other writers attempt to navigate this dashless existence, two questions arise. When are we getting them back? Will we ever?

Upon ChatGPT's release in 2022, I realized that I wrote like AI. My sentences were long, my writing patterns were predictable, and my use of em-dashes was frequent. Initially, I was not concerned: if models are being trained to write like me, I must be doing a good job, right?

Then, however, came the "AI detectors" used by teachers and reviewers. These AI detectors are machine learning tools designed to spot patterns associated with AI-generated text. They measure the predictability of a written work’s wording ("perplexity"), the variability of the sentence structure ("burstiness"), and other markers. Essentially, AIs are being used to spot content created by other AIs. At this point, I began to change my writing style. No more back-to-back 20+ word sentences. No more dash-filled phrases, semicolons, or groups of threes; I was not willing to risk being flagged.

Returning to the point of the em-dash (and other snubbed marks), there are two key reasons why they—along with words like "delve" and "underscores"—are so frequent in AI-generated writing: the training data, and the assessment of said data.

First, let's look at the data collection process for LLMs (large language models). Over 60% of the training data used in early models like GPT-3 came directly from web crawls, which collect publicly available text off the internet. After the data is collected, it is used to train models to predict language structures and patterns. Most LLMs are trained to predict the next few words in a sequence, internalizing patterns in grammar and style along the way. If a particular structure (like the em-dash) appears often enough and isn't adjusted later on, it can become a characteristic aspect of the model's output.

As a result of this pattern-based learning—and the fact that these patterns aren't always corrected—models can take on specific stylistic habits that become hard-wired as their "instinct". As Medium writer Brent Csutoras demonstrated in his failed attempts to remove em-dashes from the results of ChatGPT, Claude, and other models, the em-dash has become embedded into the output style of today's LLMs.

To be clear, you are not imagining this em-dash overuse. According to Freeburg, an independent researcher, LLMs use em-dashes much more frequently than human writers, with GPT-4.1 having a 3.28x higher frequency in standard essays. Similarly to Csutoras' conclusion, they found that em-dashes were almost entirely resistant to prompt manipulations and user restrictions.

Now, how did no one realize that AIs were learning to use em-dashes so frequently? Some journalists, including The Economist's Alex Hern, believe that the Africa-based regulation of chatbots' content is a key factor. African English uses words like "delve" much more frequently than the internet at large, which may affect the regulators' choices. However, the work of these moderators mostly ties to removing sexist, racist, and other harmful content, not directly altering the linguistic choices of the models.

Initially, I hypothesized that the explanation was tied to the datasets being used to train LLMs. However, after a small investigation comparing word frequencies in COCA—a text dataset of popular modern media (think Star Trek)—and OpenWebText, a set which mimics AIs training data, I found that while OpenWebText often "won out" in terms of frequency, the gap wasn't significant.

The em-dash frequency of OpenWebText was so high (1621.88 uses/million), I had to remove it from this chart. I have no reference for COCA and am only drawing conclusions based on words.

I then turned to another potential argument: implicit bias, or the internal perceptions and judgements of individuals. Before em-dashes rose back into fame, they were mostly used in prose and other writing spaces that encourage wide vocabularies and creative structuring. Many people didn't know what they were before LLMs began splicing them into their sentences, and given our more regular reliance on short-form content like text and emails, they didn't need to. In contrast, LLM training involves longer-form content like essays and articles, where em-dashes are more common than in the average person's consumed media. Bias explains why em-dashes feel so out of place, but not why em-dashes are actually being used more frequently than normal.

The generally accepted hypothesis to explain this overuse ties back LLMs’ training and reinforcement processes. As models learn to predict language patterns, they begin to use their learned patterns to do so. However, this isn’t the only factor determining which patterns get used more often. Models like Claude and ChatGPT have an additional goal with their responses: to provide users with clarity. Em-dashes, which allow for explanatory pauses and the breaking down of complex ideas, are an ideal tool for AIs. As such, LLMs are not only introduced to more em-dashes, but their training also reinforces their usage. This results in em-dashes appearing more frequently than in typical human writing.

So what does this mean in the long term? Personally, I believe that these models will soon reduce their use of em-dashes. Individuals are currently avoiding em-dashes and other AI "red flags", so their overall usage is decreasing. LLMs are trained to replicate the styles of human writers, and as LLMs get more frequent content updates, the decreased use of these writing tools should have an influence on their responses.

The only question is whether we, as writers, will ever go back. This fear of being "caught" has begun to overtake what writing once was: freedom of expression. There are now countless "AI spotting tricks", flagging everything from empty questions to the use of writing structures that people were once taught to use. To write "humanly", we have to write less creatively.


Lia Erisson is a second year (U2) Computer Science & Economics student minoring in Physiology. She loves exploring the intersection of technology, wellbeing, and the human experience.

Part of the OSS mandate is to foster science communication and critical thinking in our students and the public. We hope you enjoy these pieces from our Student Contributors and welcome any feedback you may have!

Back to top