When I search for solutions to my problems, I often search the internet for “compare and contrast” or analytical posts on the best tools for the job, which in turn help me make an informed decision.
Recently, my problem was scraping a website for data using python. I searched online and a lot of users recommended Scrapy over BeautifulSoup. Well, that was easy, I naively said. Scrapy probably is the better option for most people (it supports XPath right out the box). Like Scrapy’s docs put it:
comparing BeautifulSoup (or lxml) to Scrapy is like comparing jinja2 to Django.
But Scrapy didn’t settle well with my Cent OS platform (or Google Apps Engine). For one, there were a whole lot of problems trying to install Scrapy in my virtualenv (safe python environment system) because of it’s dependency on libxml2/libxslt and their bindings. Examples:
etree.so "undefined symbol: libiconv"
Version 2.6.26 found. You need at least libxml2 2.6.27 for this version of libxslt
ImportError: /pyenv/test/lib/python2.6/site-packages/libxml2mod.so: undefined symbol: xmlTextReaderSetup
No module named libxml2
Failed to find headers. "update includes_dir"
Note: This may look overly dramatic. And it maybe is a little dramatic, because a lot of these errors/problems do have solutions. Most of them can be searched out of Google results.
I endlessly chased solutions at trying to integrate libxml2, libxml2 python bindings, libxslt and lxml in a virtualenv (with python 2.6; note Cent OS/RHEL only have python2.4 in their repositories). I eventually grew tired of trying to find what is linking to what shared library and what seems to be the missing culprit. And I figured, let me just give BeautifulSoup a try. I thought I’d spend the extra time learning the library that BeautifulSoup is, as opposed to learning the “framework” that Scrapy is.
In the end, BeautifulSoup was not that hard. It may be missing XPath support in its default setup, but I could easily implement the XPaths that I had with ones using BeautifulSoup syntax.
Lesson: Don’t let your ego get into it. Save time by going for fairly-efficient solutions that can be implemented in fairly-optimal time (as my Algorithms professor used to say).
3 thoughts on “Python Scraping: Scrapy and BeautifulSoup”
Google searched for the same error message and arrived at your blog. 🙂 Been trying to accomplish the same things you did: Getting Scrapy to install on CentOS 5. After hours of endless chasing, I arrived at the same error. Reading your blog, I think I’ll give BeautifulSoup a try. If I get another Linux distribution set up with newer Python, etc. available, instead of setting up multiple Python environments, I may try Scrapy again.
I don’t think you got the message the Scrapy team was trying to Convey. Scrapy is an application framework that requires an HTML parser (as one of many other requirements). They have it implemented using one but you are free to choose any HTML parser (like BeautifulSoup or XPath etc).
There is no comparison between BeautifulSoup and Scrapy. It’s like apples and oranges.
I have used both Scrapy and Beautiful Soup. Scrapy is recommended when you have to scrape a series of pages by crawling the sites, its like a spider. On the other hand beautiful soup is just a quick script – get started and get your page parsed 🙂