Create a pipeline#
This guide walks you through creating a pipeline that uses our REST API Client
to connect to DuckDB.
Please make sure you have installed dlt before following the
steps below.
Task overview#
Imagine you want to analyze issues from a GitHub project locally.
To achieve this, you need to write code that accomplishes the following:
- Constructs a correct request.
- Authenticates your request.
- Fetches and handles paginated issue data.
- Stores the data for analysis.
This may sound complicated, but dlt provides a REST API Client that allows you to focus more on your data rather than on managing API interactions.
1. Initialize project#
Create a new empty directory for your dlt project by running:
mkdir github_api_duckdb && cd github_api_duckdb
Start a dlt project with a pipeline template that loads data to DuckDB by running:
dlt init github_api duckdb
Install the dependencies necessary for DuckDB:
pip install -r requirements.txt
2. Obtain and add API credentials from GitHub#
You will need to sign in to your GitHub account and create your access token via the Personal access tokens page.
Copy your new access token over to .dlt/secrets.toml:
[sources]
api_secret_key = '<api key value>'
This token will be used by github_api_source() to authenticate requests.
The secret name corresponds to the argument name in the source function.
Below, api_secret_key will get its value
from secrets.toml when github_api_source() is called.
@dlt.source
def github_api_source(api_secret_key: str = dlt.secrets.value):
return github_api_resource(api_secret_key=api_secret_key)
Run the github_api_pipeline.py pipeline script to test that authentication headers look fine:
python github_api_pipeline.py
Your API key should be printed out to stdout along with some test data.
3. Request project issues from the GitHub API#
Modify github_api_resource in github_api_pipeline.py to request issues data from your GitHub project's API:
from dlt.sources.helpers.rest_client import paginate
from dlt.sources.helpers.rest_client.auth import BearerTokenAuth
from dlt.sources.helpers.rest_client.paginators import HeaderLinkPaginator
@dlt.resource(write_disposition="replace")
def github_api_resource(api_secret_key: str = dlt.secrets.value):
url = "https://api.github.com/repos/dlt-hub/dlt/issues"
for page in paginate(
url,
auth=BearerTokenAuth(api_secret_key), # type: ignore
paginator=HeaderLinkPaginator(),
params={"state": "open"}
):
yield page
4. Load the data#
Uncomment the commented-out code in the main function in github_api_pipeline.py, so that running the
python github_api_pipeline.py command will now also run the pipeline:
if __name__=='__main__':
# configure the pipeline with your destination details
pipeline = dlt.pipeline(
pipeline_name='github_api_pipeline',
destination='duckdb',
dataset_name='github_api_data'
)
# print credentials by running the resource
data = list(github_api_resource())
# print the data yielded from resource
print(data)
# run the pipeline with your parameters
load_info = pipeline.run(github_api_source())
# pretty print the information on data that was loaded
print(load_info)
Run the github_api_pipeline.py pipeline script to test that the API call works:
python github_api_pipeline.py
This should print out JSON data containing the issues in the GitHub project.
It also prints the load_info object.
Let's explore the loaded data with the command dlt pipeline <pipeline_name> show.
dlt pipeline github_api_pipeline show
This will open the workspace dashboard app that gives you an overview of the data loaded.
5. Next steps#
With a functioning pipeline, consider exploring:
- Our REST Client.
- Deploy this pipeline with GitHub Actions, so that the data is automatically loaded on a schedule.
- Transform the loaded data with dbt or in Python using Pandas, Arrow, or Polars.
- Learn how to run, monitor, and alert when you put your pipeline in production.
- Try loading data to a different destination like Google BigQuery, Amazon Redshift, or Postgres.