I’m doing some fairly hardcore screenscraping using Python, so I decided to use BeautifulSoup. After all:
Oh yes it will:
File "/usr/lib/python2.6/HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: malformed start tag, at line 3, column 14
lxml parses this fine.
The other issue I’m seeing is the old document.write('<scr' + 'ipt>')
trick. Even if it’s
enclosed in a CDATA
block, BeautifulSoup chokes on it.
lxml, again, parses it fine. And it has built-in CSS selector and XPath support.
To comment on this post, mention me on twitter, or drop me an email.