Removing columns#

Removing columns before loading data into a database is a reliable method to eliminate sensitive or unnecessary fields. For example, in the given scenario, a source is created with a "country_id" column, which is then excluded from the database before loading.

Let's create a sample pipeline demonstrating the process of removing a column.

Create a source function that creates dummy data as follows:

import dlt

# This function creates a dummy data source.
@dlt.source
def dummy_source():
    @dlt.resource(write_disposition="replace")
    def dummy_data():
        for i in range(3):
            yield {"id": i, "name": f"Jane Washington {i}", "country_code": 40 + i}

    return dummy_data()

This function creates three columns: id, name, and country_code.

Next, create a function to filter out columns from the data before loading it into a database as follows:

from typing import Dict, List, Optional

def remove_columns(doc: Dict, remove_columns: Optional[List[str]] = None) -> Dict:
    if remove_columns is None:
        remove_columns = []

    # Iterating over the list of columns to be removed
    for column_name in remove_columns:
        # Removing the column if it exists in the document
        if column_name in doc:
            del doc[column_name]

    return doc

doc: The document (dict) from which columns will be removed.

remove_columns: List of column names to be removed, defaults to None.

Next, declare the columns to be removed from the table, and then modify the source as follows:

# Example columns to remove:
remove_columns_list = ["country_code"]

# Create an instance of the source so you can edit it.
source_instance = dummy_source()

# Modify this source instance's resource
source_instance.dummy_data.add_map(
    lambda doc: remove_columns(doc, remove_columns_list)
)

You can optionally inspect the result:

for row in source_instance:
    print(row)
#{'id': 0, 'name': 'Jane Washington 0'}
#{'id': 1, 'name': 'Jane Washington 1'}
#{'id': 2, 'name': 'Jane Washington 2'}

At last, create a pipeline:

# Integrating with a dlt pipeline
pipeline = dlt.pipeline(
    pipeline_name='example',
    destination='bigquery',
    dataset_name='filtered_data'
)
# Run the pipeline with the transformed source
load_info = pipeline.run(data_source)
print(load_info)