I’m doing some fairly hardcore screenscraping using Python, so I decided to use BeautifulSoup. After all:
Oh yes it will:
File "/usr/lib/python2.6/HTMLParser.py", line 115, in error raise HTMLParseError(message, self.getpos()) HTMLParser.HTMLParseError: malformed start tag, at line 3, column 14
lxml parses this fine.
The other issue I’m seeing is the old
document.write('<scr' + 'ipt>') trick. Even if it’s
enclosed in a
CDATA block, BeautifulSoup chokes on it.
lxml, again, parses it fine. And it has built-in CSS selector and XPath support.