I was tired of writing the similar bit of code every time I wanted to craw something from the web. So, I wrote this helper to save the boring part of the work.
I call it a helper because it's definitely not a framework and it's so simple that I don't want to call it a library. It's also the first time for me to scratch the surface of python meta programming. Some people call it the dirty corners of Python. Anyway, with the help of meta classes, the base class of the helper was able to be written within 40 lines.
Basically, writing a 'spider' becomes like this with this helper:
from gspider.base import BaseSpider from gspider.fields import PQField class PythonDocSpider(BaseSpider): title = PQField('h1') # select `h1` element from the page, you can use other css selectors too content = PQField('p')
To play around with it:
spider = PythonDocSpider( 'https://docs.python.org/3/library/base64.html' ) >>> print(spider.title) >>> 19.6. base64 — Base16, Base32, Base64, Base85 Data Encodings ¶
Of course, there are other fields to use, for example RegField, and PQListField.
You can install this helper by cloning the GitHub Repo and run setup:
git clone https://github.com/ericls/gspider cd gspider python setup.py install
As you might have guessed, it depends on
request is easy to install because it has no dependencies and is written in pure Python.
pyqeury however, requires
lxml which can be a bit of pain to install via pip, consult it's official document about requirements before pip install it. You may also need to increase your RAM or SWAP if you are using a VPS.
Project Name: Gspider. (Pretty random name, isn't it?)
GitHub Repo: https://github.com/ericls/gspider
- I had a second thought, I would like to call it a web content extractor now, instead of a crawler or spider.