On BeautifulSoup

Posted on March 18th, 2010 by Russ. Filed under Coding.

I’m doing some fairly hardcore screenscraping using Python, so I decided to use BeautifulSoup. After all:

Beautiful Soup won’t choke if you give it bad markup

Oh yes it will:

<html>
 <body>
  <a href="/""></a>
 </body>
</html>
  File "/usr/lib/python2.6/HTMLParser.py", line 115, in error
    raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: malformed start tag, at line 3, column 14

lxml parses this fine.

The other issue I’m seeing is the old document.write('<scr' + 'ipt>') trick. Even if it’s enclosed in a CDATA block, BeautifulSoup chokes on it.

lxml, again, parses it fine. And it has built-in CSS selector and XPath support.

Tags: , , ,

3 Responses to “On BeautifulSoup”

  1. Walter Cruz Says:

    I think that lxml is faster too, don’t? :)

  2. JP Says:

    This post was a while ago now, but I had similar problems and ended up switching to Ruby – I’ve not looked back since! the Nokogiri gem is a very powerful (xpath, css3 selectors etc) and relaxed parser — it’s also used by the mechanize gem, which lets you navigate sites programatically similarly to how you’d do it yourself. All in all, a screenscraper’s dream :)

    http://mechanize.rubyforge.org/mechanize/EXAMPLES_rdoc.html

Trackback URI

Leave a Reply


About Me

I build infrastructure.

I currently work for Smarkets as Head of Tech Operations. Before that I worked at Last.fm. I also co-founded the London Hackspace.

I live in London and sometimes moonlight as a freelance photographer.

Links

Projects