Friday, January 07, 2005

Risks of lenient parsing

Wednesday I had a frustrating experience trying to help track down a problem in a post to a blog I subscribe to. It ended happily enough when we were able to locate a syntax error in the HTML of the post, but not before we had explored several blind alleys.

Before discussing the implications, I will first present some (lightly edited) excerpts from our correspondence, to set the stage.

-----

Me: R&D Funding post. A link near the end seems to be broken.

Peter: I'm looking at the post and trying to find the broken link without success. They all seem to be working for me. Maybe you caught it as I was rebuilding the page? If not, can you give me more context for the link?

Me: It's the link associated with the text "de-emphasizing long-term, fundamental research", and it still takes me to "http://www.cra.org/govaffairs/blog/archives/increasing", which still yields a "HTTP 404 - File not found" message.

Peter: Hmmm. I'm at a loss. I've had a few people on varied platforms test it without problem. Could it be that you're looking at a cached version?
The link that shows in my browser (and my test subjects) is: http://www.cra.org/govaffairs/images/trends_in_dod_snt.jpg
A mystery....

Me: I can follow the link in your email just fine, but not from the post.
Must be a poisoned cache (perhaps caught while you were rebuilding). CTRL-F5 is SUPPOSED to force a refresh, but I've had trouble with this on our corporate network before. I'll try it from home and see if I get the same results.

Me: It's even more weird than I thought, and not simple cache poisoning.
I did a View Source on the page, while it was exhibiting the problem. In the resulting HTML, the link is exactly what you say it is. However, when I mouse over the link, I still get what I told you.
To investigate further, I saved the HTML to my local disk and repeated the experiment. Same results, except that now it looks for and fails to find a local file ("increasing" in the same directory as the main page). I cleared the browser cache and refreshed. Same results. Out of superstition, I quoted the URL in the link. Same result.
This is truly bizarre, and, as nearly as I can tell, affects only this one link.

Me: Just looked at a larger fragment of the HTML, and the problem is obvious (note the "href=increasing"). Is this corrupted relative to what you have published?

As others have <a href= http://www.aaas.org/spp/rd/upd1104.htm> noted</a>, the bulk of that 44% increase has gone to the Defense Department, which is <a href=increasing it's support for more short-term, development-oriented research and <a href= http://www.cra.org/govaffairs/images/trends_in_dod_snt.jpg> de-emphasizing long-term, fundamental research</a>. <a href=http://www.cra.org/govaffairs/defense.php>Here's more</a> on CRA's concerns about DOD research. </p>

Peter: Aha. You've found the problem!
I had intended to add a link to that "increasing it's support for more short-term, development-oriented research" phrase but changed my mind. Unfortunately, as you discovered, I left a fragment of the "<a href=" tag in the text. Apparently Safari and the browsers my "testers" used didn't stumble on the fragment -- they just ignored everything between the tag fragment and the correctly formatted "<a href>" tag that starts with "de-emphasizing." But that must not be the case with all browsers.
I didn't catch it looking at the rendered HTML because the sentence still made sense without the "increasing it's support..." phrase.
Anyway, seems to work now with the tag fragment deleted. Does it work for you?

-----

So what is the lesson in this? There was clearly a syntax error, so what we got is what we deserved, right? I think not.

Given the frequency of errors in HTML, it would be unreasonable for renderers to refuse to display pages with errors. (I only with great difficulty found the HTML bug in the first version of this post.) However, we stumbled around blindly because none of the browsers we were using gave any hint that there was a syntax error on the page. Each just silently "corrected" the error. Unfortunately, but predictably, they didn't all "correct" it in the same way, meaning that Peter and his testers were getting one result, and I was getting another.

I contend that all of the browsers were wrong not to indicate clearly the existence of a syntax error. A friendly browser would even have made some attempt to indicate the approximate location on the page of the error.

Although it was published more than thirty years ago, I think my advice on "What the Compiler Should Tell the User" (in Compiler Construction, an Advanced Course, F. L. Bauer and J. Eickel (eds.), Springer-Verlag, pp. 525–548, 1974) is still pertinent to those who build compilers and other formal language interpreters. Those who do not study the past are very likely not to learn its lessons, and therefore to repeat old mistakes.

Labels: ,

5 Comments:

Comment by Anonymous Anonymous:

Why didn't you try validating the page in question?

9:35 AM  
Comment by Anonymous Anonymous:

I was thinking the very same thing while working through a problem with my RSS reader (which will refuse to display certain feeds) and RSS feeds generated by Wordpress 1.2.1 (which would output invalid HTTP headers).

Here's a similar story: http://weblogs.asp.net/oldnewthing/archive/2004/12/21/328759.aspx

5:03 PM  
Comment by Anonymous Anonymous:

Hi Jim,

I followed the link you provided to comp.risks to find this entry. As someone who proofreads compulsively, I found it pretty amusing that you and Peter could spot the HTML syntax error, but you both missed the adjacent English syntax error: The "it's" following "increasing" should have been "its"!

Incidentally, the "preview" page provided by blogger.com before publishing a comment introduced some problems of its own. It didn't deal correctly with the phrase

I left a fragment of the "<a href=" tag in the text.

-- Everett Howe

9:06 PM  
Comment by Anonymous Anonymous:

I've recently found the dillo web-browser - http://www.dillo.org/

As well as being ultra-fast, light-weight, and buzz-word compliant, it features a neat little bug counter. If a web page contains errors, a bug count appears in the status bar: clicking gives you a summary.

(It finds two bugs on your blogspot page)

Of course - I assume you also know about http://validator.w3.org/ - although I appreciate that you were making the point that such a service should be automatically provided by browsers.

8:21 AM  
Comment by Blogger andrew:

while i agree with you in principle, i think you need to remember that most people will never track down an html bug. browser users aren't (always) developers. for most users the webpage either works, or it doesn't. unless you're the developer there's not much more you can do than send off an email asking someone to fix it. with the amount of badly formed html out there, simple, silet corrections are probably the best choice.

i'd love to see a small indicator on the status bar showing a running count of parsing errors (much like IE does with scripting errors) but i'd still rather see an attempt at rendering the page.

if you find yourself doing much debugging, grab the web developer extension to firefox, it'll save you a ton of time.

1:45 PM  

Post a Comment

<< NIASAWHIWB Home