Transform data in Python with Arrow tables or DataFrames#
You can transform your data in Python using Pandas DataFrames, Arrow tables, or Polars DataFrames. To get started, please read the dataset docs.
Interactively transforming your data in Python#
Using the methods explained in the dataset docs, you can fetch data from your destination into a DataFrame or Arrow table in your local Python process and work with it interactively. This even works for filesystem destinations:
The example below reads GitHub reactions data from the issues table and
counts the reaction types.
pipeline = dlt.pipeline(
pipeline_name="github_pipeline",
destination="duckdb",
dataset_name="github_reactions",
dev_mode=True
)
# get a data frame of all reactions from the dataset
github_issues = pipeline.dataset().table("issues")
reactions = github_issues.select(
"reactions__+1", "reactions__-1", "reactions__laugh", "reactions__hooray", "reactions__rocket"
).df()
# calculate and print out the sum of all reactions
counts = reactions.sum(0).sort_values(0, ascending=False)
print(counts)
# alternatively, you can fetch the data as an arrow table
reactions = github_issues.select(
"reactions__+1", "reactions__-1", "reactions__laugh", "reactions__hooray", "reactions__rocket"
).arrow()
# ... do transformations on the arrow table
Persisting your transformed data#
Since dlt supports Arrow tables, Pandas or Polars DataFrames from resources directly, you can use the same pipeline to load the transformed data back into the destination.
A simple example#
A simple example that creates a new table from an existing user table but only with columns that do not contain private information. Note that we use the iter_arrow() method on the relation to iterate over the arrow table instead of fetching it all at once.
pipeline = dlt.pipeline(
pipeline_name="users_pipeline",
destination="duckdb",
dataset_name="users_raw",
dev_mode=True
)
# get user relation with only a few columns selected, but omitting email and name
users = pipeline.dataset().table("users").select("age", "amount_spent", "country")
# load the data into a new table called users_clean in the same dataset
pipeline.run(users.iter_arrow(chunk_size=1000), table_name="users_clean")
A more complex example#
The example above could easily be done in SQL. Let's assume you'd like to actually do in Python some Arrow transformations. For this we will create a resource from which we can yield the modified Arrow tables. The same is possibly with DataFrames.
import pyarrow.compute as pc
pipeline = dlt.pipeline(
pipeline_name="users_pipeline",
destination="duckdb",
dataset_name="users_raw",
dev_mode=True
)
# NOTE: this resource will work like a regular resource and support write_disposition, primary_key, etc.
# NOTE: For selecting only users above 18, we could also use the filter method on the relation with ibis expressions
@dlt.resource(table_name="users_clean")
def users_clean():
users = pipeline.dataset().table("users")
for arrow_table in users.iter_arrow(chunk_size=1000):
# we want to filter out users under 18
age_filter = pc.greater_equal(arrow_table["age"], 18)
arrow_table = arrow_table.filter(age_filter)
# we want to hash the email column
arrow_table = arrow_table.append_column("email_hash", pc.sha256(arrow_table["email"]))
# we want to remove the email column and name column
arrow_table = arrow_table.drop(["email", "name"])
# yield the transformed arrow table
yield arrow_table
pipeline.run(users_clean())
A Polars example#
You can also use Polars for transformations. Polars DataFrames and LazyFrames are automatically converted to Arrow tables when yielded from a resource.
import polars as pl
pipeline = dlt.pipeline(
pipeline_name="users_pipeline",
destination="duckdb",
dataset_name="users_raw",
dev_mode=True
)
@dlt.resource(table_name="users_clean")
def users_clean():
users = pipeline.dataset().table("users")
for arrow_table in users.iter_arrow(chunk_size=1000):
# convert to Polars for transformation
df = pl.from_arrow(arrow_table)
# filter out users under 18
df = df.filter(pl.col("age") >= 18)
# drop sensitive columns
df = df.drop(["email", "name"])
# yield the Polars DataFrame directly; dlt converts it to Arrow
yield df
pipeline.run(users_clean())
Other transforming tools#
If you want to transform your data before loading, you can use Python. If you want to transform the
data after loading, you can use Pandas or one of the following:
- dbt. (recommended)
dltSQL client.