Tim Gittos

I'm an Australian currently living in Austin, TX in the USA.

I currently earn a living programming, though I wouldn't call myself a programmer. If I had to attach a label to myself, I'd use the term autodidact.

I love learning, and my favorite things to learn about are programming, computer graphics, AI & machine learning, robotics, painting and creativity.

Parsing ASP.NET pages with SGMLParser

Last updated on 09 Mar 2009

I’m going to take a short break in my Ruby CMS series to post something I encountered at work.

During my development of the CMS at work, I’ve had to deal with parsing HTML content in order to compile page content into tags. This involves being able to replace certain elements of a page with other elements. At first I tried to do this with regular expressions, however this didn’t turn out too well when it came to dealing with the inconsistencies in legacy sites running on the current, older CMS. Next, I tried parsing the pages as XDocuments, which worked ok as long as I was parsing well formed pages. As soon as it hit a malformed page, however, it died in a fiery burst of exceptions.

Then I looked at SGMLParser, and it seemed like it was my savior. It parsed SGML, of which HTML is a subset, and it auto corrected malformed content, and would handle all the inconsistencies of life, in use HTML, and it would allow me to parse it into an XDocument so I can manipulate nodes. However, even SGMLParser had it’s own problems: it wouldn’t handle ASP.NET server tags very well, which was a problem.

We provide “plug in” functionality to CMS sites by the way of ASP.NET server controls. You build your functionality, stick the server control files into your site, drop the binary into the bin folder and insert the server control tags into your page. Fire it up, and it will appear and function. However, when running pages with server control tags through the SGMLParser, it wouldn’t recognise the namespaces, and would strip them, completely breaking them.

This was unsatisfactory, as we were planning on using a similar sort of set up to perform the same thing in the new CMS. So, I hit Google, looking for ways to get around this. There’s not a lot of information out there about it, other than “SGML parses SGML, not ASP.NET, which isn’t valid SGML” which was exceedingly unhelpful, because I already know.

I peeked into the source of SGMLReader to see if maybe I could remove the functionality that stripped out unknown namespaces. Sure, this would probably defeat the purpose of using SGMLParser, but it would help me out in this specific project. Unfortunately, the codebase is rather complicated and I really didn’t want to dig in and spend hours on something I wasn’t sure was even going to work.

I saw that the project has a custom HTML dtd, so I considered briefly writing a dtd for ASP.NET. However, while I was thinking of all the different permutations of server tags, especially when you can specify your own namespaces and tag names and attributes, I quickly decided that wasn’t suitable either.

In the end, and I believe this is honestly the only way to do it, I ended up pre-processing my content, wrapping ASP.NET server tags within <![CDATA[ ]]> tags. I noticed that SGMLParser was wrapping my older <% %> ASP style tags in this fashion, and leaving them unmolested. After I parse it into the XDocument, do my manipulations and pull it back out as the desired content, I run a post-process through it, removing the <![CDATA[ ]]> tags that were added in the pre-process. After, I’m left with my content as it went in, with the modifications done through the compiling process.

Of course, I do the pre-processing and post-processing with a regular expression:
[code language=“html”]
</?[\w]+:[^>]*/?>
[/code] 
which I think is kind of funny, in a way. I can never seem to escape regular expressions.