I’m obviously a fan of WordPress. I use it here and on several sites and have donated a good deal of my time to help it succeed and have also contributed code and bug fixes in the past (though none recently). That said, the default ‘texturized’ output from WordPress gives me a few major headaches.
By default, WordPress will look through the content of your post and convert double and single quotation marks to “smart” quotation marks. Though it was a little unreliable at getting the quotes pointed the right way early on, it does a pretty darn good job now. However, my problem isn’t with the quotation marks pointing the wrong way, it’s with them being converted to smart quotes in the first place.
The main reason that the smart qotes become a problem is that they break copy-paste. This isn’t entirely WordPress’s fault – special character handling on the web is notoriously bad. However, people copy-paste web content all the time, and (thanks to these smart quotes) if they copy content outputted by WordPress and paste it into something that doesn’t properly handle those special characters, there is a good chance the result is broken.
Here is what I consider a fairly common scenario: Joe and Cathy both have WordPress blogs. Joe posts an entry that strikes a chord with Cathy. Instead of adding a comment, Cathy decides to write her own entry and send a trackback. In her entry, she copies a section of Joe’s entry to use as a blockquote. Cathy pastes the text into the WordPress edit interface and posts her entry. Now here is where the fun begins.
- Cathy probably doesn’t realize it, but those smart quotes and other special characters she pasted into her blog entry have been turned into ‘?’s1.
- Cathy’s syndication feeds are now invalid because they contain an invalid character. She may not notice, but any of her readers that use a strict parsing feed reader are no longer receiving her content.2
- If Cathy isn’t aware of the underlying problem, she really don’t know what needs fixing. It’s never good to put your users in a position of helplessness.
This is also a problem with those handy little JavaScript bookmarklets for posting web entries. By default, WordPress uses the »
entity (like two little greater than signs) as separator between elements in the page title3. Most likely (I certainly haven’t seen every bookmarklet out there), your bookmarklet inserts the special character into your edit field rather than the entity representation. If you don’t correct this and post your entry, you get the same problems listed above. I run into this myself about 4-5 times a week with my “Around the web” posts.
So what is the solution? I think there are a few different things that can be done:
- WordPress should stop converting quotes to smart quotes, three periods to elipses, etc. No need to contribute to the problem. You can disable this behavior in WordPress on a post-by-post basis using my WP Unformatted plugin, or on a global basis using this plugin.
- WordPress needs to get smarter about character encoding. This is hard, I haven’t done enough research in the area to know the right way to fix it, but as a publishing platform it is a failing of WordPress that you are so easily able to publish invalid content. You shouldn’t be able to break (make them invalid) the RSS feeds just by hitting the publish button. 🙂
Other suggestions/ideas? Perhaps there are existing solutions to this entire mess that I’m unaware of?
Actually the problem has nothing to do with WP, it’s most like related to the fact that you use iso-8859-1 instead of the standard UTF-8. Browsers have issues trying to figure out iso-8859-1 vs. windows encoding. UTF-8 copying and pasting is pretty easy — I copy/paste smart quotes and foreign languages and even weirder unicode stuff all the time with no problems. Anne van K. has some good writing on this.
The JS bookmarklet issue is separate, I think we can code around that though.
The encoding set in my options is UTF-8…
Personally, I would rather see formatting thrown out. I hate curly/smart quotes, elipses, etc.
This especially sucks when posting code…or, rather, when other people post code (i.e. in comments). Totally borks everything.
But, Alex, I do have your plugin (for code posting), and Scott Reilly’s untexturize plugin for normal posts – which are both great, but the fact that I *need* them is a little disheartening.
Don’t get me wrong: I still love wordpress…there’s just some things I could do without, ya know? 🙂
There are some larger problems with all the default-reformatting-install-a-plugin-to-shut-it-off behavior. My main problem is that while the WP-entry screen LOOKS like a text editor for entering HTML/XML code — it functions more like a word-processor, doing behind-the-scenes re-coding that is difficult to turn off. This also destroys an important difference between machine-readable code and human-readable code.
I like the curly-quotes, I like the ellipses… but I hate the paragraph tags breaking my multi-line image tags, etc. I wish the user had a more selective control over these features — or that they weren’t on by default in the first place.
There is a sense that the software knows better than the publisher what posts should look like.
This is the only fret I have with WordPress. I’ve had this happen on numerous occasions, especially when copy and pasting into different applications which generatre RSS feeds, then, viewing it on Feedreader. I was always wondering what this was.
Alex: the encoding in your options is UTF-8 but the output on the HTML template is iso-8859-1. (Line 5 of this page source.) This disjunct is the cause of a whole load of problems for people who don’t know it. Changing the charset meta tag should be easier than converting the contents of your DB to utf-8. Give it a try with the formatting on and see if it works.
It isn’t the HTML that really bothers me, browsers are very forgiving. It’s the broken RSS feeds (some people use strict parsing feed readers) that are the main problem for me.
Yes, if you’re set to UTF-8 on the backend and as OFJ pointed out your template says iso then you’ll have even more problems, some just from typing, nothing to do with copy and paste.
I appreciate all the concern, but I’ve never had problem with the HTML display on my site (despite this problem). Now the RSS feeds on the other hand, those break nearly weekly.
We should definitely endeavor to try to cover all the bases, but it’s not a problem that’s limited to WP.
Look at Sam Ruby, for example — I think most people would consider him an expert and a stickler for compliance. But his feed often contains HTML entities that aren’t valid for XML (e.g. ). As a result, I can almost never view his feed in Thunderbird, because it has an unforgiving parser.
Personally, I have my own home-grown version of your Unformatted plugin, which turns off the wptexturize and wpautop filters for the_content. Then I enter all of my posts in valid xhtml by hand.
Which still doesn’t necessarily take care of all possible RSS problems. But I haven’t seen my feed go invalid in quite a while.
Patient: “Doctor, when I paste text from other sites into my blog, it hurts!”
Doctor: “Then don’t do that!”
😉
That’s why everyone should use the UTF-8 standard, as Matt pointed out. Plus, WordPress 1.5.1 comes with a nifty
ent2ncr()
function for stuff like this.Dougal: I agree it is a larger problem that just a WP problem, that was something I tried to make clear in the original post.
Mathias: My feed is UTF-8, I have the ent2ncr and my feed breaks all the time due to the content in the Around the web posts.
I had a similar issue when reworking my scripting to generate a valid RSS feed — I am a fanstypants and I like to use these entities in my posts, often in my title. I am not familiar with the inner workings of WordPress (I chose to frankenstein a far lesser-known cms), it was easy enough to fix my issue with
preg_replace()
and a defined entity array — I am assuming that the aforementionedent2ncr()
does much the same thing. This might seem like a resource intensive method, but it only occurs at the time of publication in my CMS. I am sure the well-refined methodologies in WordPress outstrip my abilities, but itallows me the freedom to use UTF-8 at will and still output valid XML to the feed.
Cory Doctorow made a similar complaint, right down to the RSS feed problems — in early 2003!
John Gruber of Daring Fireball (and the author of a MovableType plug-in to add typographically pleasing characters) wrote what I consider to be the definitive response to this complaint, titled “Short and Curlies” (http://daringfirebal[...]_and_curlies)
Personally, I love the automatic formatting, and have actually hacked in additional transformations to the Textile 2 WP plug-in.
I think the problem goes down to the meta tags in the default themes for WordPress. For example, when I use the meta tag <meta http-equiv=”Content-Type” content=”<?php bloginfo(‘html_type’); ?>; charset=<?php bloginfo(‘charset’); ?>”> in my WordPress theme header.php file, I get an error when running through the W3C validator that reads:
The character encoding specified in the HTTP header (utf-8) is different from the value in the element (iso-8859-1). I will use the value from the HTTP header (utf-8) for this validation.
I also see this when running http://binarybonsai.com/kubrick/ through it as well. I based my theme off the Kubrick theme packaged with WordPress, so this is unsurprising. The question is, is it a plug-in breaking it or is there something overriding the meta elsewhere in the code? I don’t explicitly set the meta anywhere but my header.php file so it’s picking it up from somewhere else.
WordPress wptexturize removal hack
I wrote a plugin to remove smart quotes.
…
[…] Here’s another article about WordPress misadventures: WordPress Breaks Copy-Paste […]
I am having similar issues with copy and paste off of my blog. I don’t care about the rss feeds at this time…but what is the resolution to not have the PHP output the smart quotes using the lates 2.2? Thanks.
Upon further digging I found this old plugin that works with 2.2 currently wpuntexturize at http://www.coffee2co[...]untexturize/