Richard JP Le Guen.ca

Skip to Content
All about Software Development on the WWW
RSS feed

Navigation

Tutoring Courses

Parse and Validate Markup Server Side - Why Not, Microsoft? Why Not? August 7th, 2010

I had a fun experience recently at work where I spent the better part of a day trying to figure out what was wrong with an aspx page I was deploying to a SharePoint pages library. My investigation into it has led me to reflect a little on XHTML, and forgiveness/fault-tolerance on the Web.

Simply put, I don't understand why Microsoft – with its XML fetish ever since .NET came along – didn't include support for valid XHTML in several of their products, including Asp.NET and Internet Explorer.

My Mistake

Can you spot the mistake I made? Here's a variation of the Page Layout where I made the error:

<%@ Page language="C#"
         Inherits="Microsoft.SharePoint.Publishing.PublishingLayoutPage, Microsoft.SharePoint.Publishing, Version=12.0.0.0,Culture=neutral, PublicKeyToken=71e9bce111e9429c" %>
<%@ Register Tagprefix="RLG" Namespace="Richard.JP.LeGuen" Assembly="Richard.LeGuen, Version=8.0.0.0, Culture=neutral, PublicKeyToken=XXX" %>
<asp:Content ContentPlaceholderID="PlaceHolderMain" runat="server">
	<RLG:MyCustomControl />	
</asp:Content>

I was debugging, attached to the IIS worker process, and no matter where I put a breakpoint in my MyCustomControl class, it wasn't loading. Somehow my custom control's code was not being executed; not even a constructor.

For anyone who's found the answer, I'm ashamed to say I needed help to see it and it took the better part of my day.

For anyone who hasn't found it, I'm missing a runat='server' attribute on my the custom control. Usually you get a compile-time error when you try to write up an aspx and neglect a runat='server', but deploying to SharePoint meant I wasn't the one doing the compiling.

Why The runat Attribute?

My first reaction was to wonder "Why? Why is this attribute required for web controls? Why isn't it 'sous-entendu'?" which led me to a handful of useful answers on Stack Overflow.

Among the answers are that Microsoft was holding out for the possibility of a runat="client", with in-browser (IE-only?) controls. This didn't come together.

Another answer cited a fellow name Mike Schinkel talking with another fellow named Talbott Crowell about why runat="server" is needed.

If [runat=client] was required for all client-side tags, the parser would need to parse all tags and strip out the [runat=client] part. Currently, if my guess is correct, the parser simply ignores all text (tags or no tags) unless it is a tag with the [runat=server] attribute or a "<%" prefix or ssi "<!-- #include

This Implementation Of runat="server" Has Nurtured Bad Developer Habits

This is strange, this is confusing, and this is really too bad. The World Wide Web would been a much better/more professional place if Microsoft had decided to parse all tags, because not parsing all tags allowed developers to write pages like this:

<%@ Page language="C#" %><!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
	"http://www.w3.org/TR/html4/loose.dtd">
<html>
 <head>
	<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
	<title>Something wrong AspX</title>
 </head>
 <body>
  <form runat="server">
	<p>
		What's wrong with this page? <br>
		<asp:label id="message" runat="server" Text="There's something bad about this control" />
  </form>
 </body>
</html>

I don't like what I see here, because this document is a mix of XML syntax and HTML syntax, which is both a bad idea and the mark of an amateur. The web control has to have a closing /> while the document is HTML 4, so other empty tags don't have to close. Even some non-empty tags – like the <p> – don't need to close. You might think this is not important, but I disagree; this is encouraging and perpetuating bad habits on the part of web developers. What benefit comes of allowing lax syntax and format like this? Isn't that the kind of thing a .NET developer would criticize in some script-kiddie language like Perl?

This, as well as IE's continued lack of support for documents served as XHTML, is a strange irregularity for the all too often XML-obsessed Microsoft. It's too bad for the web at large though, because a little stricter behavior server-side could have mitigated the problem of Forgiveness by Default – at least in the context of markup – and that would have been the sort of development which could have made XHTML a success. It might have even helped "Fix the Web".

The XHTML Tragedy

Most web developers at this point think (often out loud) "Who cares? XHTML was lame; I hated the yellow screen of death" but there is no doubt in my mind that XHTML should have succeeded more than it did. Eliotte Rusty Harold provides some reasons why we should use it in his book Refactoring HTML which I agree with, but I like to spin the message along the lines of my favorite theme that the Web is not the same place it was in the 90s.

HTML was a great tool for the Web in the 90s; the Web was relatively new and was still being defined, but back then the publishing of information was more important than the consumption of information. People and businesses were a lot more focused on getting their content onto the Internet than on sifting through the content of the Internet.

Now times have changed. There is an excess of information, so much so that we're apparently at the end of the Information Age and in what some call the Attention Age; the focus isn't on publishing but on the consumption of information. Web-surfers subscribe to (consume) RSS feeds and follow (consume) Twitter Feeds and receive (consume) SMS reminders from their Google Calendars.

XML was specifically built to cater to the need to consume structured data; the w3c Recommendation on XML even states that one of its design goals is that "It shall be easy to write programs which process XML documents." Any developer with a basic understanding of string manipulation and recursion can write a basic XML parser in a fairly short amount of time. With the Attention Age focusing on consuming data so much more than publishing it, I will always feel that XHTML should have been the markup of the future.

Even with HTML5 on the horizon, I find myself tempted to be devoted to XHTML. Soon Microsoft will support it in IE9 and once that hurdle is jumped, if Web developers started ensuring well-formedness before serving content I think we'd all benefit.

I know it won't happen – HTML5 is going to take over – but would it really be so hard to do? I mean, we already take precautions to sanitize our output against (among other things) XSS attacks… why not also sanitize output to ensure well-formedness?

Actually, as it happens, my concerns are foolish as HTML5 will be compatible with XML formatting: see HTML 5, one vocabulary, two serializations.

As a little demo; visit my XHTML Sanitizer and submit some XHTML; it will be rendered in the page, but sanitize to remove any malicious markup (JavaScript, etc) and any ill-formed markup (unclosed tags, etc). Sure, the latter functionality can't be perfect in the end – sometimes content has to disappear to ensure well-formedness – but missing content sure sure beats this:

Parse-error

It should be noted that writing a server-side formatter like this would be harder to do for HTML, as one cannot rely on simple well-formedness, and has to do some validation – which requires distinct rules and logic for different tags.


blog comments powered by Disqus
Content © 2008-2012 Richard Jean-Paul Le Guen