Thursday 18 March, 2010
I’m doing some fairly hardcore screenscraping using Python, so I decided to use BeautifulSoup. After all:
Beautiful Soup won't choke if you give it bad markup
Oh yes it will:
File "/usr/lib/python2.6/HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: malformed start tag, at line 3, column 14
lxml parses this fine.
The other issue I’m seeing is the old document.write('<scr' + 'ipt>') trick. Even if it’s enclosed in a CDATA block, BeautifulSoup chokes on it.
lxml, again, parses it fine. And it has built-in CSS selector and XPath support.