On BeautifulSoup
Posted on March 18th, 2010 by Russ. Filed under Coding.
I’m doing some fairly hardcore screenscraping using Python, so I decided to use BeautifulSoup. After all:
Oh yes it will:
<html> <body> <a href="/""></a> </body> </html>
File "/usr/lib/python2.6/HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: malformed start tag, at line 3, column 14
lxml parses this fine.
The other issue I’m seeing is the old document.write('<scr' + 'ipt>') trick. Even if it’s enclosed in a CDATA block, BeautifulSoup chokes on it.
lxml, again, parses it fine. And it has built-in CSS selector and XPath support.
March 24th, 2010 at 19:57
I think that lxml is faster too, don’t? :)
March 24th, 2010 at 20:10
Have you seen? http://www.crummy.com/software/BeautifulSoup/3.1-problems.html
August 10th, 2010 at 12:27
This post was a while ago now, but I had similar problems and ended up switching to Ruby – I’ve not looked back since! the Nokogiri gem is a very powerful (xpath, css3 selectors etc) and relaxed parser — it’s also used by the mechanize gem, which lets you navigate sites programatically similarly to how you’d do it yourself. All in all, a screenscraper’s dream :)
http://mechanize.rubyforge.org/mechanize/EXAMPLES_rdoc.html