Scrapy-Redis-BloomFilter

Scrapy Redis Bloom Filter

Showing:

Popularity

Downloads/wk

0

GitHub Stars

147

Maintenance

Last Commit

2mos ago

Contributors

2

Package

Dependencies

1

License

MIT

Categories

Readme

Scrapy-Redis-BloomFilter

This is a package for supporting BloomFilter of Scrapy-Redis.

Installation

You can easily install this package with pip:

pip install scrapy-redis-bloomfilter

Dependency:

  • Scrapy-Redis >= 0.6.8

Usage

Add this settings to settings.py:

# Use this Scheduler, if your scrapy_redis version is <= 0.7.1
SCHEDULER = "scrapy_redis_bloomfilter.scheduler.Scheduler"

# Ensure all spiders share same duplicates filter through redis
DUPEFILTER_CLASS = "scrapy_redis_bloomfilter.dupefilter.RFPDupeFilter"

# Redis URL
REDIS_URL = 'redis://localhost:6379'

# Number of Hash Functions to use, defaults to 6
BLOOMFILTER_HASH_NUMBER = 6

# Redis Memory Bit of Bloom Filter Usage, 30 means 2^30 = 128MB, defaults to 30
BLOOMFILTER_BIT = 10

# Persist
SCHEDULER_PERSIST = True

Test

Here is a test of this project, usage:

git clone https://github.com/Python3WebSpider/ScrapyRedisBloomFilter.git
cd ScrapyRedisBloomFilter/test
scrapy crawl test

Note: please change REDIS_URL in settings.py.

Spider like this:

from scrapy import Request, Spider

class TestSpider(Spider):
    name = 'test'
    base_url = 'https://www.baidu.com/s?wd='

    def start_requests(self):
        for i in range(10):
            url = self.base_url + str(i)
            yield Request(url, callback=self.parse)

        # Here contains 10 duplicated Requests
        for i in range(100):
            url = self.base_url + str(i)
            yield Request(url, callback=self.parse)

    def parse(self, response):
        self.logger.debug('Response of ' + response.url)

Result like this:

{'bloomfilter/filtered': 10, # This is the number of Request filtered by BloomFilter
 'downloader/request_bytes': 34021,
 'downloader/request_count': 100,
 'downloader/request_method_count/GET': 100,
 'downloader/response_bytes': 72943,
 'downloader/response_count': 100,
 'downloader/response_status_count/200': 100,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 8, 11, 9, 34, 30, 419597),
 'log_count/DEBUG': 202,
 'log_count/INFO': 7,
 'memusage/max': 54153216,
 'memusage/startup': 54153216,
 'response_received_count': 100,
 'scheduler/dequeued/redis': 100,
 'scheduler/enqueued/redis': 100,
 'start_time': datetime.datetime(2017, 8, 11, 9, 34, 26, 495018)}

Rate & Review

Great Documentation0
Easy to Use0
Performant0
Highly Customizable0
Bleeding Edge0
Responsive Maintainers0
Poor Documentation0
Hard to Use0
Slow0
Buggy0
Abandoned0
Unwelcoming Community0
100
No reviews found
Be the first to rate

Alternatives

No alternatives found

Tutorials

No tutorials found
Add a tutorial