Transform data in Python with Arrow tables or DataFrames#

You can transform your data in Python using Pandas DataFrames, Arrow tables, or Polars DataFrames. To get started, please read the dataset docs.

Interactively transforming your data in Python#

Using the methods explained in the dataset docs, you can fetch data from your destination into a DataFrame or Arrow table in your local Python process and work with it interactively. This even works for filesystem destinations:

The example below reads GitHub reactions data from the issues table and
counts the reaction types.

pipeline = dlt.pipeline(
    pipeline_name="github_pipeline",
    destination="duckdb",
    dataset_name="github_reactions",
    dev_mode=True
)

# get a data frame of all reactions from the dataset
github_issues = pipeline.dataset().table("issues")
reactions = github_issues.select(
    "reactions__+1", "reactions__-1", "reactions__laugh", "reactions__hooray", "reactions__rocket"
).df()

# calculate and print out the sum of all reactions
counts = reactions.sum(0).sort_values(0, ascending=False)
print(counts)

# alternatively, you can fetch the data as an arrow table
reactions = github_issues.select(
    "reactions__+1", "reactions__-1", "reactions__laugh", "reactions__hooray", "reactions__rocket"
).arrow()
# ... do transformations on the arrow table

Persisting your transformed data#

Since dlt supports Arrow tables, Pandas or Polars DataFrames from resources directly, you can use the same pipeline to load the transformed data back into the destination.

A simple example#

A simple example that creates a new table from an existing user table but only with columns that do not contain private information. Note that we use the iter_arrow() method on the relation to iterate over the arrow table instead of fetching it all at once.

pipeline = dlt.pipeline(
    pipeline_name="users_pipeline",
    destination="duckdb",
    dataset_name="users_raw",
    dev_mode=True
)

# get user relation with only a few columns selected, but omitting email and name
users = pipeline.dataset().table("users").select("age", "amount_spent", "country")

# load the data into a new table called users_clean in the same dataset
pipeline.run(users.iter_arrow(chunk_size=1000), table_name="users_clean")

A more complex example#

The example above could easily be done in SQL. Let's assume you'd like to actually do in Python some Arrow transformations. For this we will create a resource from which we can yield the modified Arrow tables. The same is possibly with DataFrames.

import pyarrow.compute as pc

pipeline = dlt.pipeline(
    pipeline_name="users_pipeline",
    destination="duckdb",
    dataset_name="users_raw",
    dev_mode=True
)

# NOTE: this resource will work like a regular resource and support write_disposition, primary_key, etc.
# NOTE: For selecting only users above 18, we could also use the filter method on the relation with ibis expressions
@dlt.resource(table_name="users_clean")
def users_clean():
    users = pipeline.dataset().table("users")
    for arrow_table in users.iter_arrow(chunk_size=1000):

        # we want to filter out users under 18
        age_filter = pc.greater_equal(arrow_table["age"], 18)
        arrow_table = arrow_table.filter(age_filter)

        # we want to hash the email column
        arrow_table = arrow_table.append_column("email_hash", pc.sha256(arrow_table["email"]))

        # we want to remove the email column and name column
        arrow_table = arrow_table.drop(["email", "name"])

        # yield the transformed arrow table
        yield arrow_table

pipeline.run(users_clean())

A Polars example#

You can also use Polars for transformations. Polars DataFrames and LazyFrames are automatically converted to Arrow tables when yielded from a resource.

import polars as pl

pipeline = dlt.pipeline(
    pipeline_name="users_pipeline",
    destination="duckdb",
    dataset_name="users_raw",
    dev_mode=True
)

@dlt.resource(table_name="users_clean")
def users_clean():
    users = pipeline.dataset().table("users")
    for arrow_table in users.iter_arrow(chunk_size=1000):
        # convert to Polars for transformation
        df = pl.from_arrow(arrow_table)

        # filter out users under 18
        df = df.filter(pl.col("age") >= 18)

        # drop sensitive columns
        df = df.drop(["email", "name"])

        # yield the Polars DataFrame directly; dlt converts it to Arrow
        yield df

pipeline.run(users_clean())

Other transforming tools#

If you want to transform your data before loading, you can use Python. If you want to transform the
data after loading, you can use Pandas or one of the following:

dbt. (recommended)
dlt SQL client.