WordPress Breaks Copy-Paste

I’m obviously a fan of WordPress. I use it here and on several sites and have donated a good deal of my time to help it succeed and have also contributed code and bug fixes in the past (though none recently). That said, the default ‘texturized’ output from WordPress gives me a few major headaches.

By default, WordPress will look through the content of your post and convert double and single quotation marks to “smart” quotation marks. Though it was a little unreliable at getting the quotes pointed the right way early on, it does a pretty darn good job now. However, my problem isn’t with the quotation marks pointing the wrong way, it’s with them being converted to smart quotes in the first place.

The main reason that the smart qotes become a problem is that they break copy-paste. This isn’t entirely WordPress’s fault – special character handling on the web is notoriously bad. However, people copy-paste web content all the time, and (thanks to these smart quotes) if they copy content outputted by WordPress and paste it into something that doesn’t properly handle those special characters, there is a good chance the result is broken.

Here is what I consider a fairly common scenario: Joe and Cathy both have WordPress blogs. Joe posts an entry that strikes a chord with Cathy. Instead of adding a comment, Cathy decides to write her own entry and send a trackback. In her entry, she copies a section of Joe’s entry to use as a blockquote. Cathy pastes the text into the WordPress edit interface and posts her entry. Now here is where the fun begins.

  1. Cathy probably doesn’t realize it, but those smart quotes and other special characters she pasted into her blog entry have been turned into ‘?’s1.
  2. Cathy’s syndication feeds are now invalid because they contain an invalid character. She may not notice, but any of her readers that use a strict parsing feed reader are no longer receiving her content.2
  3. If Cathy isn’t aware of the underlying problem, she really don’t know what needs fixing. It’s never good to put your users in a position of helplessness.

This is also a problem with those handy little JavaScript bookmarklets for posting web entries. By default, WordPress uses the » entity (like two little greater than signs) as separator between elements in the page title3. Most likely (I certainly haven’t seen every bookmarklet out there), your bookmarklet inserts the special character into your edit field rather than the entity representation. If you don’t correct this and post your entry, you get the same problems listed above. I run into this myself about 4-5 times a week with my “Around the web” posts.

So what is the solution? I think there are a few different things that can be done:

  • WordPress should stop converting quotes to smart quotes, three periods to elipses, etc. No need to contribute to the problem. You can disable this behavior in WordPress on a post-by-post basis using my WP Unformatted plugin, or on a global basis using this plugin.
  • WordPress needs to get smarter about character encoding. This is hard, I haven’t done enough research in the area to know the right way to fix it, but as a publishing platform it is a failing of WordPress that you are so easily able to publish invalid content. You shouldn’t be able to break (make them invalid) the RSS feeds just by hitting the publish button. :)

Other suggestions/ideas? Perhaps there are existing solutions to this entire mess that I’m unaware of?

  1. In most browsers – all handle it a little differently. [back]
  2. WordPress should do a better job encoding special characters so that they don’t break syndication feeds, but again encoding on the web is a real mess. [back]
  3. What you see in your browser titlebar. [back]