Monthly Archives: April 2010

Congratulations Marc!

Time for some cheesy family stuff; as it turns out I have a brother – Marc Le Guen – who is pursuing an MBA at the John Molson School of Business.

It’s treating him well, and there’s lots of big news from him; he’s been elected VP Academic for the Commerce Grad students something something (the title’s kind of long) and is going to be a key organizer for the next JMSB International Case Competition.

As such, he was selected as one of the JMSB poster boys for the new batch of brochures this year… see page 16 (numbered as 13).

Marc Le Guen

SOEN287: Parsing XML In Perl

I don’t know if any students in the f2010 SOEN287 class actually read the main page of my web site, but a number of students have shown me their work on writing an XML file and parser1 for their final project, and their work has concerned me. The example I saw which concerned me the most was as follows: (it’s an XML session file)

The structure of the following xml session file is an anti pattern

<hand>Ace of Hearts, 2 of Clubs, 8 of Diamonds, 7 of Hearts, King of Clubs</hand>
<hand>Ace of Spades, Ace of Clubs, 8 of Clubs, 8 of Spades, 7 of Diamonds</hand>
<deck>Ace of Hearts, 2 of Hearts, (…)</deck>

This is not correct, and – in my opinion – demonstrates a lack of understanding of XML. The three major problems I can identify are:

  1. There is no root node; in XML you always need a root element. When you were writing XHTML the root node was <html>.
  2. There are comma separated values nested in your elements for no good reason; if you’re using XML formatting, use XML formatting. If you’re using CSV, use CSV; don’t mix them up like this.
  3. There is no indicator in the XML as to which hand belongs to which player.

Last week, Thursday, I gave an additional tutorial on parsing XML using Perl. While the technique I demonstrated used RegEx, you cannot parse XML (or XHTML) using regular expressions. The keyword in regular expressions is ‘regular’ and it is there to indicate that regular expressions are a tool to be used with Regular Languages but XML (and XHTML) are not regular languages, they are Context Free Languages but not Regular Languages.

As such you cannot use something like this to extract an element from XML:

# This is so wrong!!
if($xml =~ /<something>.+</something>/) {
	# it won't work!

… this is because XML elements can be nested, and your regular expression cannot ensure that you haven’t matched the wrong close tag </something>.

    Valid XML which will break your regex
  Valid XML which will break your regex

This is a mathematically proven reality of life; you cannot parse XML with RegExs.

The following is the code I demoed in tutorial on Thursday, for those who look at this page. I have included only a screenshot to ensure that you still have to read and understand it a little. I have also omitted the parseCard subroutine so that you will have to generalize what you learned in the first two subroutines to make it work.



Good luck.

1For the record, they are writing their parsers manuall because this is an introductory course – writing the parser is an exercise in using recursive algorithms.