Amputate the W3C Validator

In recent conversation about the usefulness of validation and the W3C’s Validation Service on Mike Davidson’s freshly launched Mike Industries (love that new IFR smell), Mike pined for the validator to be able to ignore certain errors introduced by legacy content management or ad serving software—a primary concern being unencoded ampersands.

So I went and dug up Nat Irons’s Amputator MT plugin. Based on his clever regular expression I cooked up a little PHP script that takes any url and outputs it with all ampersands properly encoded.

It doesn’t fix unquoted attributes or uppercase tags, I have no desire to cheat the validator—that would only be cheating ourselves. All this does is filter out the noise of cascading errors caused by something that has absolutely no bearing on the visual display or usability of a page.

Before amputation, ESPN returns 335 errors. After amputation, only 56! Fifty-six is a far more manageable number of errors to sort through when looking for the root of a display problem.

To be clear, the goal of this script isn’t to undermine the value of validation or the efforts of those working on the validator, but to make our jobs easier when dealing with immovable obstacles like unskilled content managers and legacy code. This sort of preprocessor actually increases the value of the validator while at the same time reducing the bandwidth generated by its use.

Since I’m in a generous mood (and don’t need the bandwidth increase associated with hosting the only copy!) here’s the source code:

<!-- 
Please visit ShaunInman.com  [(http://www.shauninman.com/mentary/past &hellip; idator.php](http://(http://www.shauninman.com/mentary/past/amputate_the_w3c_validator.php)) for more information about this validator preprocessor
Ampersand-encoding and clever name based entirely on Nat Irons's Amputator MT plugin: [http://bumppo.net/projects/amputator](http://bumppo.net/projects/amputator/)
-->
<?php if (!empty($HTTP_GET_VARS["uri"])) { echo preg_replace("/&(?!#?[xX]?(?:[0-9a-fA-F]+|w{1,8});)/i","&amp;",file_get_contents($HTTP_GET_VARS["uri"])); } else { ?>
Please visit <a href="http://www.shauninman.com/mentary/past/amputate_the_w3c_validator.php">ShaunInman.com</a> for more information about this validator preprocessor. 
<?php } ?>

This script could be expanded to correct a host of other none display/accessibility errors. A really useful revision would allow the user to determine the severity of errors to report. A level for true nesting errors. Maybe another that catches cascading errors caused by single tag elements missing their closing slash. The possibilities are worth exploring. Something for another post perhaps?

Previous
Compositionals
Next
Googlegeist
Author
Shaun Inman
Posted
June 16th, 2004 at 11:52 pm
Categories
PHP
Web
Comments
012 (Now closed)

012 Comments

001

Please, fix software instead of validators. Unencoded ampersands can even give problems in older browsers like Netscape 4 (not that I care). More important is that it would crash every real XHTML page. ‘&’ is the start of a entity and therefore it will cause problems, same for ‘>’ and ‘

Author
Anne
Posted
Jun 17th, 2004 4:59 am
002

Anne, you are right. The software that generates the invalid ampersands should be fixed, rather than tricking the validator. It is sort of the same thing as using Javascript to include Flash in a site. The site validates, but only because the validator doesn’t see the embed tag written by the javascript.

But what about when the invalid code is produced by something beyond your control. In the case of ESPN (and countless other sites), it is an ad server.

Author
Jeremy Flint
Posted
Jun 17th, 2004 5:22 am
003

Anne, I know you’re intelligent. You can’t be missing the point so you must be ignoring it. CMS and WYSIWYG software already exists that is capable of properly encoding these entities. The problem lies in the fact that clients aren’t using them. They may not have it in their budgets to upgrade, the time to learn how to utilize improved functionality, or the attention span to understand why they should do either.

At that point no matter how ideal your intentions there’s still a job to be done. If you’re updating a site, say adding a widget to the homepage and your addition isn’t displaying correctly so you go to the validator. It spits out 280 extra errors resulting from unencoded entities that have no bearing on the display and that you do not have access to or the budget to correct.

All the standards rhetoric in the world isn’t going to help you find and correct the cause of your display error any faster. Eliminating those false alarms will.

Author
Shaun Inman
Posted
Jun 17th, 2004 5:34 am
004

Here here! Anne, Shaun is right. No one is suggesting that we permanently try to change the rules of the W3C validator here. Only that we make it more useful for what we can fix.

Shaun’s solution lets us find our low-hanging fruit without chopping the entire tree down.

Author
Mike D.
Posted
Jun 17th, 2004 8:50 am
005

So what do you think it would take to work with some of the major ad serving agencies to get them to encode the ampersand’s in their out put.?

Also Mike is there any way you could using some server side scripting on your side to encode the ampersand’s that come from the ad server making it output valid code?

Although your 356 errors is a lot it’s not bad at all compared to the newspaper industry. Most newspapers don’t have doctypes or character sets defined, and when you set them and try to validate their pages most of 400+ errors.

Author
Brian Paulson
Posted
Jun 17th, 2004 9:40 am
006

I would think that some Javascript or PHP could be used to find all ‘&’ that are not encoded and replace them with encoded ‘&’. That could put a load on the user and/or server, depending on how it is handled. With the type of traffic that ESPN gets, that could really add up quick. 300 per page x 1 million pages a day (at least) = a lot.

But like the point Mike made on his site the other day, it doesn’t matter whether they are encoded or not. As long as the pages are being served as text/html and not application/xhtml+xml, does it really matter that they are not encoded (from the standpoint of working, not validating)?

Author
Jeremy Flint
Posted
Jun 17th, 2004 12:01 pm
007

The PHP/server-side solution is feasible but the JavaScript version would do no good as far as the validator is concerned.

Actually, on a site like ESPN.com where the PHP approach would be most useful (a massive site with multiple content managers and third-party code that gets an unfathomable number of hits a day) the server processing required would not be economical.

Author
Shaun Inman
Posted
Jun 17th, 2004 1:30 pm
008

Brian, there are a few newspaper sites that validate, including ours (www.nwanews.com), which validates XHTML 1.1 (the front page at least; some of the interior pages are broken by —- you guessed it —- unencoded entities coming from a CMS that I don’t have control over). The new version (due August 1) will be MUCH more attractive AND validate XHTML 1.0 Transitional (or Strict, I haven’t decided yet).

But, more to the point here, I appreciate the script as I often have unencoded ampersands in my data that clog up the works when I’m working on a design. Like someone said above, it’s easier to get the low-hanging fruit first then go after the ampersands and whatnot.

Author
Steven
Posted
Jun 17th, 2004 1:33 pm
009

Shaun,

That’s fantastic. It managed to get CNet News.com down from 1227 to 409 errors. I mean, that’s still one seriously broken site, but it certainly helps: Non-amputated — vs — Amputated

Oh, here’s my (still static and rather out of date) ValiDAQ. I really must finish my cron/scripted solution…

Author
Tim
Posted
Jun 25th, 2004 5:29 am
010

It’s sad that a site with the development brains and clout of ESPN doesn’t use both to start working on the ad agencies who are serving up the crud. You know, something like a 10% premium for non-valid code being served up by the 3rd party servers.

I really don’t get this whole “to heck with validation” movement you guys have going on here. Why are you setting yourselves up for a fall?

I live in Ontario, Canada. My province recently released additions/clarifictions to the ODA (Ontarians With Disabilities Act), which makes web accessibility a mandated outcome for commercial sites as well. Rightly or wrongly, they use the WCAG checklist as their reference guide - and that checklist is unequivicable in it’s requirement for valid code. No amount of dancing around can overcome this fact.

Now ensuring 100% compliance to this requirement is pie-in-the-sky; I’m no dummy - I too do this for a living. But deliberatley flaunting the issue - thumbing your noses at it like your buddy Mike Davidson is so fond of doing - is only inviting the long lens of the law onto your site. And shrugging your shoulders and saying “oh well”, not my problem… c’mon!

If it is a personal site - have a party. But can ESPN really afford that kind of negative publicity? Do the suits in the front office know what the monkeys in the back are up to? Why has this become such a cause celebre for you guys?

Good luck kids

Author
John Foliot
Posted
Oct 26th, 2004 3:36 pm
011

Hey pops, reread the comments, we’re on your side.

Author
Shaun Inman
Posted
Oct 26th, 2004 9:34 pm
012

John:

  1. The fact that you suggest throwing a 10% premium in the face of someone handing you millions of dollars shows me you have not worked in anything close to the commercial environment we are talking about here. I don’t even want to discuss this point any further because of how silly it is.

  2. There is no movement against validation itself. My personal movement, if I have one, is just against overzealous validation evangelism. Might I enjoy religion in the traditional sense yet not care for televangelists? I think so.

  3. With regards to the ODA you mentioned: The thing you have to understand is that as much as you want to believe we’re deliberately going in the wrong direction, that simply isn’t true. In fact, the opposite is true. With every redesign, we get closer and closer to validation and closer and closer to optimal accessibility. The team in New York just relaunched ABCNews.com a couple of weeks ago and error levels are pretty low right off the bat. The thing you have to let go of is the notion that everything is fixable instantly. For you, a development job might be coding from scratch a 300 page site built from 10 templates. For us, it might be retrofitting a 300,000 page site built from 150 templates coming from different sources and written at different times. When you’re dealing with this sort of quantity and flexibility, you make improvements gradually. In the end, sites should be judged on their continuing progress, and it is my opinion that we are leaders among our competition in this regard.

  4. Thanks for the monkey reference. My cause celebre is breaking barriers and making great websites. If you want to thumb your nose at that and create an imaginary war between sinners and saints, that’s your choice.

Author
Mike D.
Posted
Oct 26th, 2004 10:45 pm