Datasets

Concatenated Dataset

Pandas Concatenated Dataset

Pandas concatenated dataset is a sugar for the PartitionedDataset that concatenates all dataframe partitions into a single dataframe.

For example, let’s say you have a folder structure like this:

clients/
├── brazil.csv
├── canada.csv
└── united_states.csv

And you wan’t to load all the files as a single dataset. In this case, you could do something like this:

clients:
  type: kedro_partitioned.dataset.PandasConcatenatedDataset
  path: clients
  dataset:
    type: pandas.CSVDataset

Then, the clients dataset will be all the concatenated dataframes from the clients/*.csv files.

Path Safe Partitioned Dataset

Note

it is recommended to use PathSafePartitionedDataset instead of PartitionedDataset, for every step parallelism scenario. This is important because handling path safely is mandatory for the multinode partitioned dataset zip feature to work properly.