Scrapy#
This verified source utilizes Scrapy, an open-source and collaborative framework for web scraping.
Scrapy enables efficient extraction of required data from websites.
Setup guide#
Initialize the verified source#
To get started with your data pipeline, follow these steps:
-
Enter the following command:
dlt init scraping duckdbThis command will initialize
the pipeline example
with Scrapy as the source and duckdb
as the destination. -
If you'd like to use a different destination, simply replace
duckdbwith the name of your
preferred destination. -
After running this command, a new directory will be created with the necessary files and
configuration settings to get started.
Add credentials#
-
The
config.toml, looks like:# put your configuration values here [sources.scraping] start_urls = ["URL to be scraped"] # please set me up! start_urls_file = "/path/to/urls.txt" # please set me up!When both
start_urlsandstart_urls_fileare provided, they will be merged and deduplicated
to ensure Scrapy gets a unique set of start URLs. -
Inside the
.dltfolder, you'll find a file calledsecrets.toml, which is where you can securely
store your access tokens and other sensitive information. It's important to handle this
file with care and keep it safe. -
Next, follow the destination documentation instructions to
add credentials for your chosen destination, ensuring proper routing of your data to the final
destination.
For more information, read Secrets and Configs.
Run the pipeline#
In this section, we demonstrate how to use the MySpider class defined in "scraping_pipeline.py" to
scrape data from "https://quotes.toscrape.com/page/1/".
-
Start by configuring the
config.tomlas follows:[sources.scraping] start_urls = ["https://quotes.toscrape.com/page/1/"] # please set me up!Additionally, set destination credentials in
secrets.toml, as discussed. -
Before running the pipeline, ensure that you have installed all the necessary dependencies by
running the command:pip install -r requirements.txt -
You're now ready to run the pipeline! To get started, run the following command:
python scraping_pipeline.py
Customization#
Create your own pipeline#
If you wish to create your data pipeline, follow these steps:
-
The first step requires creating a spider class that scrapes data
from the website. For example, the classMyspiderbelow scrapes data from
URL: "https://quotes.toscrape.com/page/1/".class MySpider(Spider): def parse(self, response: Response, **kwargs: Any) -> Any: # Iterate through each "next" page link found for next_page in response.css("li.next a::attr(href)"): if next_page: yield response.follow(next_page.get(), self.parse) # Iterate through each quote block found on the page for quote in response.css("div.quote"): # Extract the quote details result = { "quote": { "text": quote.css("span.text::text").get(), "author": quote.css("small.author::text").get(), "tags": quote.css("div.tags a.tag::text").getall(), }, } yield resultDefine your own class tailored to the website you intend to scrape.
-
Configure the pipeline by specifying the pipeline name, destination, and dataset as follows:
pipeline = dlt.pipeline( pipeline_name="scrapy_pipeline", # Use a custom name if desired destination="duckdb", # Choose the appropriate destination (e.g., bigquery, redshift) dataset_name="scrapy_data", # Use a custom name if desired )To read more about pipeline configuration, please refer to our
documentation. -
To run the pipeline with customized scrapy settings:
run_pipeline( pipeline, MySpider, # you can pass scrapy settings overrides here scrapy_settings={ # How many sub pages to scrape # https://docs.scrapy.org/en/latest/topics/settings.html#depth-limit "DEPTH_LIMIT": 100, "SPIDER_MIDDLEWARES": { "scrapy.spidermiddlewares.depth.DepthMiddleware": 200, "scrapy.spidermiddlewares.httperror.HttpErrorMiddleware": 300, }, "HTTPERROR_ALLOW_ALL": False, }, write_disposition="append", )In the above example, scrapy settings are passed as a parameter. For more information about
scrapy settings, please refer to the
Scrapy documentation. -
To limit the number of items processed, use the "on_before_start" function to set a limit on
the resources the pipeline processes. For instance, setting the resource limit to two allows
the pipeline to yield a maximum of two resources.def on_before_start(res: DltResource) -> None: res.add_limit(2) run_pipeline( pipeline, MySpider, batch_size=10, scrapy_settings={ "DEPTH_LIMIT": 100, "SPIDER_MIDDLEWARES": { "scrapy.spidermiddlewares.depth.DepthMiddleware": 200, "scrapy.spidermiddlewares.httperror.HttpErrorMiddleware": 300, } }, on_before_start=on_before_start, write_disposition="append", ) -
To create a pipeline using Scrapy host, use
create_pipeline_runnerdefined in
helpers.py. As follows:scraping_host = create_pipeline_runner(pipeline, MySpider, batch_size=10) scraping_host.pipeline_runner.scraping_resource.add_limit(2) scraping_host.run(dataset_name="quotes", write_disposition="append")