Python Scraping: Scrapy and BeautifulSoup

When I search for solutions to my problems, I often search the internet for “compare and contrast” or analytical posts on the best tools for the job, which in turn help me make an informed decision.

Recently, my problem was scraping a website for data using python. I searched online and a lot of users recommended Scrapy over BeautifulSoup. Well, that was easy, I naively said. Scrapy probably is the better option for most people (it supports XPath right out the box). Like Scrapy’s docs put it:

comparing BeautifulSoup (or lxml) to Scrapy is like comparing jinja2 to Django.

But Scrapy didn’t settle well with my Cent OS platform (or Google Apps Engine). For one, there were a whole lot of problems trying to install Scrapy in my virtualenv (safe python environment system) because of it’s dependency on libxml2/libxslt and their bindings. Examples:


etree.so "undefined symbol: libiconv"
Version 2.6.26 found. You need at least libxml2 2.6.27 for this version of libxslt
ImportError: /pyenv/test/lib/python2.6/site-packages/libxml2mod.so: undefined symbol: xmlTextReaderSetup
No module named libxml2
Failed to find headers. "update includes_dir"

Note: This may look overly dramatic. And it maybe is a little dramatic, because a lot of these errors/problems do have solutions. Most of them can be searched out of Google results.

I endlessly chased solutions at trying to integrate libxml2, libxml2 python bindings, libxslt and lxml in a virtualenv (with python 2.6; note Cent OS/RHEL only have python2.4 in their repositories). I eventually grew tired of trying to find what is linking to what shared library and what seems to be the missing culprit. And I figured, let me just give BeautifulSoup a try. I thought I’d spend the extra time learning the library that BeautifulSoup is, as opposed to learning the “framework” that Scrapy is.

In the end, BeautifulSoup was not that hard. It may be missing XPath support in its default setup, but I could easily implement the XPaths that I had with ones using BeautifulSoup syntax.

Lesson: Don’t let your ego get into it. Save time by going for fairly-efficient solutions that can be implemented in fairly-optimal time (as my Algorithms professor used to say).

Filezilla FTP Server: “filename invalid” fix OR vsFTPD: “Could not create file”

Filezilla FTP Server can give a cryptic “filename invalid” message. Example:

(000004)9/1/2010 2:17:25 AM - backups (X.X.X.X)> STOR /var/lib/mysql/test/items.MYD
(000004)9/1/2010 2:17:26 AM - backups (X.X.X.X)> 550 Filename invalid

Or, vsFTPD can give a similar cryptic message “could not create file” message. Example:

Sep 11 07:22:23 unknown ftp.info vsftpd[8035]: [backups] FTP command: Client "X.X.X.X", "STOR ./var/lib/mysql/test/items.MYD"
Sep 11 07:22:23 unknown ftp.info vsftpd[8035]: [backups] FTP response: Client "X.X.X.X", "553 Could not create file."

It means that the filename specified (on upload) cannot be stored on the server because of invalid characters. The most common culprit is when you are trying to use the manual ftp command line, and calling
put /var/local/mysql/test/items.MYD

To fix this, you must change the local directory on upload
lcd /var/local/mysql/test/
and then call
put items.MYD

Photoshop: Saved Gif Turns Red When Adding Text

Sometimes you search something on google, and the most trivial answers do not show up. So I figured as a self reminder, I’ll post this here.

Does your background turn to a red tint when you add text to a GIF image saved as a PSD?
Well, do this to fix:

Open the .GIF in Photoshop. Covert the image to RGB (Image > Mode > RGB). Gifs are Indexed color and need to be converted to RGB. You’ll then be able to add layers and edit…then save as .PSD. or whatever format you choose.

Hackintosh doesn’t connect via ethernet

Does your Mac Ethernet/Wireless keep giving you a fake 169.254.x.x ip address?

Recently I started running Mac OS X Leopard (10.5.8) on my spare computer. It worked fine for a few days until one day when I changed my router, my Mac stopped connecting to the router. It would repeatedly get a 169.254.x.x ip address through DHCP. If I set the IP address manually, it would show that it assigned the IP address, yet it would still not “properly” be connected in the background.

So here is what I did to fix this problem. Open up Terminal using Spotlight. And type the command:

sudo ifconfig en0 ether 00:11:22:33:44:55

It will then ask you for your password. That’s it.

Explanation:

sudo: Gives you administrative power

ifconfig: the program that interacts with all your connectivity configurations

en0: the interface’s name, it could also be en1, en2, etc. (depending on the amount of network wired/wireless cards you have)

00:11:22:33:44:55: the mac address, pick whatever hex combination (i.e. only use characters 0-9 and a-z)

Please share your experiences.

Helpful Linux Tips

Command line

Download all specified extension files from an html page:

wget -r -t1 -N -np -A.mp3 http://google.com/music/audio/

-np dont ascend to parent
-r recursive
-l1 level DONT NEED
-N timestamping
-nd no directories DONT NEED
-t 1 = tries
-H span across hosts

To remove quotas, edit /etc/fstab and remove grpquota,usrquota,
then execute the remount, replacing /home with the name:

mount -o remount /home

Convert unix timestamp to readable format in Bash

date -d @1280565192

Kill multiple processes using grep:

kill -9 `ps aux | grep perl | grep nobody | awk '{print $2}'`

Xargs: handle spaces and punctuation properly:

xargs -0

Using ack and sed, edit files in place

sed -i 's/replacestring/replacedwiththis/g' `~/bin/ack --php "searchstring" -l`

Rename doesn’t support renaming with a dash/hyphen, so we must use this forloop/mv hack:

for i in ./*foo*;do mv -- "$i" "${i//test test2/test - test2}";done

Find Command

COPY files less than 24 hours old to /some/other/directory

find . -type f -ctime -1 | xargs -I {} cp {} /some/other/directory

MOVE files less than 24 hours old to /some/other/directory

find . -type f -ctime -1 | xargs -I {} mv {} /some/other/directory

Scan files for certain text

find dir/ -name "*.txt" -exec grep -Hn "md5_func" {} \;

Find all directories and sub-directories that are empty.

find ./ -type d -empty

MySQL

Using mysql from command line, here’s how to save results to an outfile (in interactive mode):

SELECT * INTO outfile '/tmp/sql.out' FROM tablename WHERE condition  = '1';

Using mysql from command line, here’s how to save resultset to an outfile (using a sql file):

mysql database -u username -p < batch.sql > sql.out