I’m doing some fairly hardcore screenscraping using Python, so I decided to use BeautifulSoup. After all:

Beautiful Soup won’t choke if you give it bad markup

Oh yes it will:

<html>
 <body>
  <a href="/""></a>
 </body>
</html>
</pre>
File "/usr/lib/python2.6/HTMLParser.py", line 115, in error
    raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: malformed start tag, at line 3, column 14

lxml parses this fine.

The other issue I’m seeing is the old document.write('<scr' + 'ipt>') trick. Even if it’s enclosed in a CDATA block, BeautifulSoup chokes on it.

lxml, again, parses it fine. And it has built-in CSS selector and XPath support.

To comment on this post, mention me on mastodon, or drop me an email.