Analyzing Pipelines Performance

1. Introduction

As well known scrapy is one the most popular frameworks for web scraping regardless of the programming language. Furthermore, its core was built on top of twisted, which is a framework for asynchronous programming in python.

Until today when I have built some pipelines to persist my data in a database, for instance in MongoDB I have used some connector based on some synchronous/blocking process, like pymongo. That’s said, occurred to me if there is some significant difference if the pipeline uses an asynchronous driver for interact with the database instead of the synchronous. For evaluating such difference we’re going to use txmongo which is written using twisted.

For the purpose of this study, we’ll use scrapy-bench that is a piece of software written for benchmark analysis suggested by scrapy community.

2. Environment Setup

First of all, let’s clone the scrapy-bench repository.

git clone https://github.com/scrapy/scrapy-bench

Our next step consists of using a docker container to host locally a website Books to Scrape for our scraping test. Firstly, we’re going to build a image.

docker build -t scrapy-bench-server -f docker/Dockerfile .

and then run the container to host our locally version of the website.

docker run --rm -ti --network=host scrapy-bench-server

The usage of scrapy-bench consists of several options listed below. One of these is broadworm which uses 1000 copies of the site Books to Scrape which are dynamically generated using twisted.

Usage: scrapy-bench [OPTIONS] COMMAND1 [ARGS]... [COMMAND2 [ARGS]...]...

  A benchmark suite for Scrapy.

Options:
  --n-runs INTEGER  Take multiple readings for the benchmark.
  --only_result     Display the results only.
  --upload_result   Upload the results to local codespeed
  --book_url TEXT   Use with bookworm command. The url to book.toscrape.com on
                    your local machine

  --vmprof          Profling benchmarkers with Vmprof and upload the result to
                    the web

  -s, --set TEXT    Settings to be passed to the Scrapy command. Use with the
                    bookworm/broadworm commands.

  --help            Show this message and exit.

Commands:
  bookworm         Spider to scrape locally hosted site
  broadworm        Broad crawl spider to scrape locally hosted sites
  cssbench         Micro-benchmark for extraction using css
  csv              Visit URLs from a CSV file
  itemloader       Item loader benchmarker
  linkextractor    Micro-benchmark for LinkExtractor()
  urlparseprofile  Urlparse benchmarker
  xpathbench       Micro-benchmark for extraction using xpath

Moving on in our setup we’re going to add a network mapping in /etc/hosts file. This is a necessary step for our crawler identify all domains needed to run our benchmark. There are 1000 records to add to that file and put those in manually way is out of the question. Therefore we wrote a python script to generate those texts for the required mapping to our benchmark. Thus with that done we just need copy and paste the content into /etc/hosts and that task will be done.

with open("./domain.txt", "w") as f:
    for i in range(1, 1001):
        f.write(f"127.0.0.1    domain{i}\n")

Our next goal is to create a virtual environment with the required dependencies for our project. However scrapy-bench project evaluates mainly aspects of the throughput of the crawling process, which may be defined based on the number of pages or items crawled per a certain amount of time, usually measured in seconds. Otherwise, our scope consists to evaluate performance on writing data in a MongoDB database based on two drivers: pymongo and txmongo, where the former is synchronous and the latter is based on an asynchronous process.

So we need to update the project dependencies and we’re going to edit edit the setup.py file.

from setuptools import setup

setup(
        name='Scrapy-Benchmark',
        version='1.0',
        packages=['scrapy_bench'],
        install_requires=[
            'Click',
            'scrapy',
            'statistics',
            'six',
            'vmprof',
            'colorama<=0.4.1',
            'pymongo',
            'txmongo',
        ],
        entry_points='''
		[console_scripts]
		scrapy-bench=bench:cli
		''',
)

Thereby we can install the project requirements with the following commands.

python -m venv .venv
source .venv/bin/activate
pip install --editable .

So, last but not less important we’re going to create an instance of MongoDB database. For that, I’ve written a docker compose file.

mongo:
    image: mongo:3.6
    container_name: mongodb
    restart: always
    ports:
        - 27017:27017
    environment:
        - PUID=1000
        - PGID=1000
    volumes:
        - ~/development/Docker/Volumes/MongoDB:/data/db

To put our database up and running we run the following command.

docker-compose -f docker-compose.yml up mongo

3. Customizing the pipeline

As we said we’re going to evaluate the performance of two connectors for MongoDB database. For that, we’ll use an extension written by Dimitrios Kouzis-Loukas which measures throughput and latencies along the process of crawling. Better than that this extension tracks the amount of time consumed of an item along the item pipeline, which in our case is the persistence of the data in MongoDB. If you are interested in checkout the full implementation of this extension you can find the code here.

Thus to activate this extension in our code base we need to include the code below in the settings.py of our scrapy project.

EXTENSIONS = {'broad.latencies.latencies.Latencies': 300}
LATENCIES_INTERVAL = 2

Moreover, we’ll customize our spider BroadBenchSpider to persist the data from Latencies extension into a file for our posterior analysis.

logname = 'sync-test.log'
#logname = 'async-test.log'
logger = logging.getLogger(name)
logger.setLevel(logging.INFO)
fh = logging.FileHandler(logname)
fh.setLevel(logging.INFO)
logger.addHandler(fh)

3.1 pymongo

We’re going to use a pipeline for pymongo available in this repository. After we created an module to host a MongoDBPipeline class which implement our pipeline to save an item in the database, we need to insert aditional configuration in settings.py in similar manner we did before.

MONGODB_URI = 'mongodb://localhost:27017'
MONGODB_DATABASE = 'scrapy'
MONGODB_COLLECTION = 'books'
ITEM_PIPELINES = {'broad.mongosync.mongosync.MongoDBPipeline': 300}

3.1.1 pymongo benchmark

For getting some reliable metrics for our analysis we’re going to run our tests with 10 replicates. So we’ll start the test by the command:

time scrapy-bench --n-runs 10 broadworm

That way our test will crawl our target website ten times which one of them will close automatically when reach 800 items collected (CLOSESPIDER_ITEMCOUNT = 800). This code will give to us some stats about throughput in ours runs. Well, that said we present below the stats for the pymongo pipeline.

The results of the benchmark are (all speeds in items/sec) :


Test = 'Broad Crawl' Iterations = '10'


Mean : 18.736101409402963 Median : 21.69220137534533 Std Dev : 7.742630114239552


real    13m17.442s
user    4m6.533s
sys     0m3.214s

With that done we can inspect a few lines of sync-test.log file which logs stats provided by the Latencies extension.

Scraped 0 items at 0.0 items/s, avg latency: 0.00 s and avg time in pipelines: 0.00 s
Scraped 50 items at 25.0 items/s, avg latency: 1.33 s and avg time in pipelines: 1.03 s
Scraped 13 items at 6.5 items/s, avg latency: 2.52 s and avg time in pipelines: 0.17 s
Scraped 16 items at 8.0 items/s, avg latency: 4.53 s and avg time in pipelines: 0.14 s
Scraped 53 items at 26.5 items/s, avg latency: 5.58 s and avg time in pipelines: 0.16 s
Scraped 70 items at 35.0 items/s, avg latency: 6.05 s and avg time in pipelines: 0.23 s
Scraped 83 items at 41.5 items/s, avg latency: 6.00 s and avg time in pipelines: 0.42 s
Scraped 81 items at 40.5 items/s, avg latency: 5.23 s and avg time in pipelines: 0.82 s
Scraped 78 items at 39.0 items/s, avg latency: 6.33 s and avg time in pipelines: 1.20 s
Scraped 86 items at 43.0 items/s, avg latency: 6.07 s and avg time in pipelines: 1.22 s
Scraped 96 items at 48.0 items/s, avg latency: 5.70 s and avg time in pipelines: 1.11 s

With this file in our hands, we’re going to analyze this log to evaluate the time spent in the pipeline stage.

dfsync = pd.read_csv("./data/sync-test.log", header=None, sep=';')
dfsync.rename(columns={0 : 'log'}, inplace=True)
dfsync[['num_items', 'throughput', 'avg_latency', 'avg_time_in_pipeline']] = dfsync['log'].apply(lambda x : re.findall("(\d+(?:\.\d+)?)", x)).tolist()
for col in ['throughput', 'avg_latency', 'avg_time_in_pipeline']:
  dfsync[col] = dfsync[col].astype(float) 
dfsync['num_items'] = dfsync['num_items'].astype(int)

As we could see in the following table on average our spider spent 0.42 seconds in the pipeline stage when we used pymongo as our driver to persist data in MongoDB database.
sync_describe

3.2 txmongo

For this driver we’re going to use an implementation for MongoDB pipeline available in this repository. First, we need to change our pipeline settings to use MongoPipeline class which implements the use of txmongo to communicate with the database.

ITEM_PIPELINES = {
    #'broad.mongosync.mongosync.MongoDBPipeline': 300,
    'broad.asyncpipe.pipelines.mongo.MongoPipeline': 300,
}

3.2.1 txmongo benchmark

Similarly, we’ll repeat the procedure for this test.

time scrapy-bench --n-runs 10 broadworm

After some minutes our test was done and the results are:

The results of the benchmark are (all speeds in items/sec) :


Test = 'Broad Crawl' Iterations = '10'


Mean : 20.92516241044525 Median : 21.582681058580057 Std Dev : 6.066783428367009


real    9m42.868s
user    4m23.577s
sys     0m3.396s

Analyzing the data logged by our extension we can determine the average time spent for our crawler in the spider pipeline.

dfasync = pd.read_csv("./data/async-test.log", header=None, sep=';')
dfasync.rename(columns={0 : 'log'}, inplace=True)
dfasync[['num_items', 'throughput', 'avg_latency', 'avg_time_in_pipeline']] = dfasync['log'].apply(lambda x : re.findall("(\d+(?:\.\d+)?)", x)).tolist()
for col in ['throughput', 'avg_latency', 'avg_time_in_pipeline']:
  dfasync[col] = dfasync[col].astype(float) 
  dfasync['num_items'] = dfasync['num_items'].astype(int)

aysnc_head

4. Results

The table bellow shows the consolidated results of our tests.

driver	items/sec	pipeline time	benchmark overall time
pymongo	18.74 ± 7.74	0.41 ± 0.39	13m17.442s
txmongo	20.93 ± 6.07	0.66 ± 0.61	9m42.868s

First of all, we’re going to make a statistical test called 2 sample t-test to evaluate if there is a statistical difference between a time spent in the pipeline stage of our spider when it uses pymongo or txmongo to persist data in MongoDB database. Furthermore, we’ll perform our test with a confidence interval of 95% and our null hypothesis for this test will be:

H(0) : There is no difference in the time spent in scrapy pipeline when saving data onto database when use txmongo instead of pymongo.

H(1) : There is a difference in the time spent in scrapy pipeline when saving data onto database when use txmongo instead of pymongo.

For this task, we need to select all the records when the numbers of items collected by our spider were different from zero on each test and by each sample we collect the average and standard deviation for this data.

N_pymongo = 242
N_txmongo = 224
avg_pymongo = 0.41
avg_txmongo = 0.66
std_pymongo = 0.39
std_txmongo = 0.61
df = N_pymongo + N_txmongo - 2 #degree of freedom

Then we can calculate the t-statistic for our test.

t_calc = (avg_pymongo-avg_txmongo)/np.sqrt(((std_pymongo**2)/N_pymongo)+((std_txmongo**2)/N_txmongo))

With the above calculation, we got a value of -5.224. Then we need to compare this value with the critical t-value from t-distribution. There are several sources to get this data, one of these is this site. Thus, the critical value for t is 1.97. Therefore we can’t reject the null hypothesis and there are no statistical evidence that the average time spent saving data in database is different when use txmongo instead of pymongo.

Besides that, even that we can’t identify any difference in the pipeline stage for changing our driver from pymongo to txmongo, we need to mention that the overall time consumed for our complete test (10 runs) took 794.4 seconds to complete when the test was executed with pymongo, that is approximately 36% slower when our test was runned with txmongo. Maybe this enhancement for the overall process of web scraping when we’re using txmongo, could be justified by the use of the same engine used by scrapy.

Thus I hope you have enjoyed this article the same way I have it to write it. The code used in this article can be found in this repository. See you in the next post!

I'm José