A Python Web Crawler Helper

I was tired of writing the similar bit of code every time I wanted to craw something from the web. So, I wrote this helper to save the boring part of the work.

I call it a helper because it's definitely not a framework and it's so simple that I don't want to call it a library. It's also the first time for me to scratch the surface of python meta programming. Some people call it the dirty corners of Python. Anyway, with the help of meta classes, the base class of the helper was able to be written within 40 lines.

Basically, writing a 'spider' becomes like this with this helper:

from gspider.base import BaseSpider
from gspider.fields import PQField


class PythonDocSpider(BaseSpider):
    title = PQField('h1')  # select `h1` element from the page, you can use other css selectors too
    content = PQField('p')

To play around with it:

spider = PythonDocSpider(
    'https://docs.python.org/3/library/base64.html'
)

>>> print(spider.title)
>>> 19.6. base64 — Base16, Base32, Base64, Base85 Data Encodings ¶

Of course, there are other fields to use, for example RegField, and PQListField.

You can install this helper by cloning the GitHub Repo and run setup:

git clone https://github.com/ericls/gspider
cd gspider
python setup.py install

As you might have guessed, it depends on request and pyquery. request is easy to install because it has no dependencies and is written in pure Python. pyqeury however, requires lxml which can be a bit of pain to install via pip, consult it's official document about requirements before pip install it. You may also need to increase your RAM or SWAP if you are using a VPS.

More information:

Project Name: Gspider. (Pretty random name, isn't it?)

GitHub Repo: https://github.com/ericls/gspider

NOTES:

  1. I had a second thought, I would like to call it a web content extractor now, instead of a crawler or spider.
comments powered by Disqus