Python Scraping: Scrapy and BeautifulSoup

When I search for solutions to my problems, I often search the internet for “compare and contrast” or analytical posts on the best tools for the job, which in turn help me make an informed decision.

Recently, my problem was scraping a website for data using python. I searched online and a lot of users recommended Scrapy over BeautifulSoup. Well, that was easy, I naively said. Scrapy probably is the better option for most people (it supports XPath right out the box). Like Scrapy’s docs put it:

comparing BeautifulSoup (or lxml) to Scrapy is like comparing jinja2 to Django.

But Scrapy didn’t settle well with my Cent OS platform (or Google Apps Engine). For one, there were a whole lot of problems trying to install Scrapy in my virtualenv (safe python environment system) because of it’s dependency on libxml2/libxslt and their bindings. Examples: "undefined symbol: libiconv"
Version 2.6.26 found. You need at least libxml2 2.6.27 for this version of libxslt
ImportError: /pyenv/test/lib/python2.6/site-packages/ undefined symbol: xmlTextReaderSetup
No module named libxml2
Failed to find headers. "update includes_dir"

Note: This may look overly dramatic. And it maybe is a little dramatic, because a lot of these errors/problems do have solutions. Most of them can be searched out of Google results.

I endlessly chased solutions at trying to integrate libxml2, libxml2 python bindings, libxslt and lxml in a virtualenv (with python 2.6; note Cent OS/RHEL only have python2.4 in their repositories). I eventually grew tired of trying to find what is linking to what shared library and what seems to be the missing culprit. And I figured, let me just give BeautifulSoup a try. I thought I’d spend the extra time learning the library that BeautifulSoup is, as opposed to learning the “framework” that Scrapy is.

In the end, BeautifulSoup was not that hard. It may be missing XPath support in its default setup, but I could easily implement the XPaths that I had with ones using BeautifulSoup syntax.

Lesson: Don’t let your ego get into it. Save time by going for fairly-efficient solutions that can be implemented in fairly-optimal time (as my Algorithms professor used to say).

Convert Django MySQL Database Tables to Unicode

When I created a Django application, I hadn’t noticed that my MySQL was defaulted to latin character set (probably by Virtualmin or CentOS’s default MySQL values). So I didn’t want to delete my current project and start again. So here are the commands to convert a database to unicode:

for the database


on each table do

ALTER TABLE djangotablename CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci

UnicodeDecodeError (unexpected code byte) on a template

Started receiving this error after I pasted some template code from a WordPress blog (it could also happen from any word processing product like Microsoft Office’s MS Word). The solution to this problem was that I had to look through my code to hunt down the following characters and replace them with their equivalents:
” ‘ ’
Replaced with (respectively):
" ' '

Django: “Error importing authentication backend”

This is probably a very rare error that one may encounter in Django. But I think I should share it here, as it would save about an hour of anybody else who has this problem.

Exception Type: ImproperlyConfigured at /
Error importing authentication backend

Probable Cause
I was very desparate to change the name of an app inside my Django project. I renamed the folder name and all possible mentions of the application name anywhere in the code and the database tables (Please note: This is not recommended, there is probably a better solution to do this). Once I faced that problem with no clear indication of where I was going wrong, I looked everywhere in the code and the database. After going into panic mode, I tried desperately changing and removing anything that may break. In the end, I ran out of places to find the application name but the error still existed.

I had noticed after looking at my cookies that I still had cookies from my session, which meant that everytime I connected to the server, I was trying to pass my “delicious” cookies. But just deleting your own cookies won’t do it. The session object of the user was cached in the database inside the table “django_session”. This especially stores the “AUTHENTICATION_BACKENDS” last used. So, truncate the table: TRUNCATE TABLE django_session to finally get rid of this nasty problem.

DholCutz Bhangra Radio Android App v1

DholCutz Bhangra Radio Android App is out now! Get it from the Android Market.

The Story: I had planned to do a DholCutz Bhangra Radio app about 4-5 months ago, but I never got the time to do it because my initial plans were to do one big app. Apparently, the demand of the app was too much that I just let it go and programmed a quick version for now. The Android Market approval process (1 day) is the best, light years better than iPhone App Store (3 weeks).

The App: I decided to price this app for $1 USD, which might change when I release a much much better version of this app that allows requesting songs from inside the app. Until now, this will do for the punjabi music fans. The app plays music from the radio and shows the current playing song.

The Screenshots:

Thank you.

MySQL trigger error: Explicit or implicit commit is not allowed in stored function or trigger.

I haven’t seen this documented in MySQL docs, so I’ll share this little hidden nuissance. When compiling a trigger, MySQL throws the following error: Explicit or implicit commit is not allowed in stored function or trigger.

What does it mean?
The code inside the trigger is doing a commit. Looking at the code for the trigger alone is not enough, you must check all procedure/function calls, because the code inside any calls could really be (part of) the problem.

The problem is usually we look in our code and find that there are no “explicit” commits. The problem is that there is a implicit commit happening somewhere and it is hard to pinpoint where if you didn’t know this one hidden fact:

Depending on version and storage engine, TRUNCATE can cause the table to be dropped and recreated. This provides a much more efficient way of deleting all rows from a table, but it does perform an implicit COMMIT. You might want to use DELETE instead of TRUNCATE.

DELETE FROM tablename rather than TRUNCATE TABLE tablename.

Avoid for (var x in array) when using jQuery/PrototypeJS

I was looking to make my code look more readable by “cleverly” using for (var x in array) loops instead of for (var x=0; x < array.length; x++), even though the shorter for loops are not supposed to be used with arrays, but used only with objects.

Turns out that jquery/prototypeJS put in extra hidden variables inside the array. If for whatever reason you are not using a JavaScript framework/library, you can use cross-browser shortcut: for (var x in array). However, be cautioned JavaScript experts are particularly annoyed by it.

Coldfusion 9 to 9.0.1 Update Error: “Variable ENABLEIMPLICITUDFREGISTRATION is undefined.” or Data Source page blank

If you see the following error in Coldfusion Administrator -> Server Settings -> Settings:

Or, when trying to add a Data Source, such as Microsoft Access (with Unicode), you see a blank page.

This is most likely because you have upgraded from a previous CF9 installation to either CF 9.0.1 or the CHF1 (hotfix 1 for 9.0.0 and 9.0.1). Sometimes you think you haven’t done an upgrade, but during the install process, pay particular attention, the CF9 Installer goes through a migration process, make sure to skip that.

It is actually easy to port over old settings by manually copying from one CF Admin panel to another, rather than try to figure out how to fix this bug. I’d like to add: there is no known fix, it is an open issue in CF Bug Tracker.

Filezilla FTP Server: “filename invalid” fix OR vsFTPD: “Could not create file”

Filezilla FTP Server can give a cryptic “filename invalid” message. Example:

(000004)9/1/2010 2:17:25 AM - backups (X.X.X.X)> STOR /var/lib/mysql/test/items.MYD
(000004)9/1/2010 2:17:26 AM - backups (X.X.X.X)> 550 Filename invalid

Or, vsFTPD can give a similar cryptic message “could not create file” message. Example:

Sep 11 07:22:23 unknown vsftpd[8035]: [backups] FTP command: Client "X.X.X.X", "STOR ./var/lib/mysql/test/items.MYD"
Sep 11 07:22:23 unknown vsftpd[8035]: [backups] FTP response: Client "X.X.X.X", "553 Could not create file."

It means that the filename specified (on upload) cannot be stored on the server because of invalid characters. The most common culprit is when you are trying to use the manual ftp command line, and calling
put /var/local/mysql/test/items.MYD

To fix this, you must change the local directory on upload
lcd /var/local/mysql/test/
and then call
put items.MYD

Photoshop: Saved Gif Turns Red When Adding Text

Sometimes you search something on google, and the most trivial answers do not show up. So I figured as a self reminder, I’ll post this here.

Does your background turn to a red tint when you add text to a GIF image saved as a PSD?
Well, do this to fix:

Open the .GIF in Photoshop. Covert the image to RGB (Image > Mode > RGB). Gifs are Indexed color and need to be converted to RGB. You’ll then be able to add layers and edit…then save as .PSD. or whatever format you choose.