Create a pipeline#

This guide walks you through creating a pipeline that uses our REST API Client
to connect to DuckDB.

Please make sure you have installed dlt before following the
steps below.

Task overview#

Imagine you want to analyze issues from a GitHub project locally.
To achieve this, you need to write code that accomplishes the following:

Constructs a correct request.
Authenticates your request.
Fetches and handles paginated issue data.
Stores the data for analysis.

This may sound complicated, but dlt provides a REST API Client that allows you to focus more on your data rather than on managing API interactions.

1. Initialize project#

Create a new empty directory for your dlt project by running:

mkdir github_api_duckdb && cd github_api_duckdb

Start a dlt project with a pipeline template that loads data to DuckDB by running:

dlt init github_api duckdb

Install the dependencies necessary for DuckDB:

pip install -r requirements.txt

2. Obtain and add API credentials from GitHub#

You will need to sign in to your GitHub account and create your access token via the Personal access tokens page.

Copy your new access token over to .dlt/secrets.toml:

[sources]
api_secret_key = '<api key value>'

This token will be used by github_api_source() to authenticate requests.

The secret name corresponds to the argument name in the source function.
Below, api_secret_key will get its value
from secrets.toml when github_api_source() is called.

@dlt.source
def github_api_source(api_secret_key: str = dlt.secrets.value):
    return github_api_resource(api_secret_key=api_secret_key)

Run the github_api_pipeline.py pipeline script to test that authentication headers look fine:

python github_api_pipeline.py

Your API key should be printed out to stdout along with some test data.

3. Request project issues from the GitHub API#

Modify github_api_resource in github_api_pipeline.py to request issues data from your GitHub project's API:

from dlt.sources.helpers.rest_client import paginate
from dlt.sources.helpers.rest_client.auth import BearerTokenAuth
from dlt.sources.helpers.rest_client.paginators import HeaderLinkPaginator

@dlt.resource(write_disposition="replace")
def github_api_resource(api_secret_key: str = dlt.secrets.value):
    url = "https://api.github.com/repos/dlt-hub/dlt/issues"

    for page in paginate(
        url,
        auth=BearerTokenAuth(api_secret_key), # type: ignore
        paginator=HeaderLinkPaginator(),
        params={"state": "open"}
    ):
        yield page

4. Load the data#

Uncomment the commented-out code in the main function in github_api_pipeline.py, so that running the
python github_api_pipeline.py command will now also run the pipeline:

if __name__=='__main__':
    # configure the pipeline with your destination details
    pipeline = dlt.pipeline(
        pipeline_name='github_api_pipeline',
        destination='duckdb',
        dataset_name='github_api_data'
    )

    # print credentials by running the resource
    data = list(github_api_resource())

    # print the data yielded from resource
    print(data)

    # run the pipeline with your parameters
    load_info = pipeline.run(github_api_source())

    # pretty print the information on data that was loaded
    print(load_info)

Run the github_api_pipeline.py pipeline script to test that the API call works:

python github_api_pipeline.py

This should print out JSON data containing the issues in the GitHub project.

It also prints the load_info object.

Let's explore the loaded data with the command dlt pipeline <pipeline_name> show.

dlt pipeline github_api_pipeline show

This will open the workspace dashboard app that gives you an overview of the data loaded.

5. Next steps#

With a functioning pipeline, consider exploring:

Our REST Client.
Deploy this pipeline with GitHub Actions, so that the data is automatically loaded on a schedule.
Transform the loaded data with dbt or in Python using Pandas, Arrow, or Polars.
Learn how to run, monitor, and alert when you put your pipeline in production.
Try loading data to a different destination like Google BigQuery, Amazon Redshift, or Postgres.