Grabber is a concurrent declarative web scraper and downloader.
Features:
Run grabber -h
to see command-line options.
See examples/
directory and consult the code to learn the format
of the config files.
Note that for tumblr.json
you'll need to replace all occurrences of
{{name}}
with a proper account (subdomain) name and all occurrences of
{{paging}}
with the (XPath's text() operator) contents of what your target
blog uses for 'next page' (or semantically equivalent). You may also notice
that the format is already template-friendly, so you can easily write a script
for generating per-blog templates.
The examples provided are certainly not exhaustive.
Advice:
Remember you can build your config iteratively by using the log
command,
so that you make sure the current level works as it should before going
further.
When downloading:
For the first run set bail
to 0
and use options -quiet -stdout
,
you may also wish to pipe the output of the run to tee log
.
Then inspect the output/logfile for any errors. If it looks ok set bail
to
something reasonable e.g. if you have 10 assets per page set it to 20.
Content-Disposition
Copyright (c) 2014 Piotr S. Staszewski
Absolutely no warranty. See LICENSE.txt for details.
Version | Tag | Published |
---|---|---|
v0.0.0-20140630143513-c2a48495b53b | 1yr ago |