Exporting Workbooks from Kedro with Openpyxl

Exporting Workbooks from Kedro with Openpyxl

This post is going to walk through how to go about writing (and formatting) excel files using Kedro and openpyxl in Python. For the uninitiated, Kedro is an open-source framework for creating data-science code. This is not meant to be a comprehensive tutorial in kedro (for that see the first two references at the end of this post). However for the sake of any part of this to make sense, there are a few concepts that need to be explained.

Kedro Concepts

Kedro builds pipelines, using nodes that accepts inputs and return outputs. The inputs/outputs to these nodes which are meant to be pure python functions can be either parameters defined in yaml files or an entry in kedro's data-catalog. For example, in the following pipeline (produced using kedro-viz), we have a raw table undergoing some processing by a node named "Create Base Table" which stores its output as "Base Table":

The mechanics of nodes, etc are covered in-depth in Kedro's documentation, however, there was one aspect in getting an Excel file as an output (as opposed to a parquet, csv etc) that required a slightly more nuanced approach.

When defining a data-catalog entry for kedro, (such as the Base Table above which is the output of the node), we define it as follows in yaml:

base_table:
    type: pandas.ParquetDataSet
    filepath: dbfs:/mnt/data/02_intermediate/base_table.pq
    layer: intermediate

The above assumes that our node returns some python object which we can call pandas to_parquet function on, which then writes to disk. The "type" is a list of different datasets supported within the kedro environment. We can confirm this theory by looking at kedro's implementation of the pandas.ParquetDataSet here.

The Mechanics of the Data Catalog

The data-catalog provides an interface between python code and data, by abstracting away the mechanics of loading/saving into dataset types, stored at the specified filepath.

Each of kedro's datasets must implement the _load, _save and _describe methods, which contain instructions on how a given dataset type implements load, save and repr respectively. Kedro has a dataset to cater to writing excel files called the ExcelDataSet If we look at its _save method:

   def _save(self, data: Union[pd.DataFrame, Dict[str, pd.DataFrame]]) -> None:
        output = BytesIO()
        save_path = get_filepath_str(self._get_save_path(), self._protocol)

        # pylint: disable=abstract-class-instantiated
        with pd.ExcelWriter(output, **self._writer_args) as writer:
            if isinstance(data, dict):
                for sheet_name, sheet_data in data.items():
                    sheet_data.to_excel(
                        writer, sheet_name=sheet_name, **self._save_args
                    )
            else:
                data.to_excel(writer, **self._save_args)

        with self._fs.open(save_path, mode="wb") as fs_file:
            fs_file.write(output.getvalue())

        self._invalidate_cache()

we can see that this dataset uses pandas ExcelWriter to output sheets into an excel file, which assumes that the data being saves is compatible with pandas native excel implementation.

Nuances with openpyxl

openpyxl (a popular python library for formatting excel workbooks), handles writing slightly differently. Although pandas can use openpyxl as a backend for writing, things become slightly more complicated when trying to write more tailored files. openpyxl uses the concept of a workbook (analogous to an Excel workbook), with sheets. For example:

from openpyxl import Workbook

wb = Workbook() # create a new workbook

ws = wb.active # get the active sheet

# do some formatting and data insertion
ws.title = "Title One"
ws['B4'] = "2026-01-01"
ws['B5'] = 100.00

The above code creates a Workbook object and does data insertion. Naturally, we would want to export this workbook to an .xlsx file. However, if we use the mechanics defined in ExcelDataset above:

In [8]: with ExcelWriter("file.xlsx") as writer:
   ...:     wb.to_excel(writer)
   ...:
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-8-7575d1f1c47c> in <cell line: 1>()
      1 with ExcelWriter("file.xlsx") as writer:
----> 2     wb.to_excel(writer)
      3

AttributeError: 'Workbook' object has no attribute 'to_excel'

This is, upon further introspection, actually expected. The Workbook object of openpyxl does not natively support being written using pandas.ExcelWriter , and instead has its own interface using .save. (we'd gloss over its mechanics at the moment, the more important part is now creating a kedro dataset type that is compatible with writing this object).

Custom Datasets in kedro

As mentioned before, a custom dataset type in kedro must implement three methods:

  • _load : how to read the file

  • _save: mechanics on determining the file-path and saving

  • _describe: essentially what __repr__ does

So to create our custom dataset which handles openpyxl workbooks, its _save method must not use pandas.ExcelWriter, but must instead call .save directly on the input data object:

from openpyxl import WorkBook, load_workbook

from kedro.io import AbstractDataSet
from kedro.io.core import get_filepath_str, get_protocol_and_path

class OpenPyxlDataSet(AbstractDataset):
    def __init__(self, filepath: str):
        self._filepath = filepath

    def _load(self, ) -> openpyxl.WorkBook:
        return load_workbook(self._filepath)

    def _save(self, data: openpyxl.WorkBook) -> None:
        data.save(self._filepath)

    def _describe(self, ): -> None:
        return dict(
            filepath=self._filepath
        )

Note the above lacks much of the necessary error-handling code that would make this dataset production ready.

Now, we can define a data-catalog entry (assuming our new dataset is in src/project/extras/datasets.py) as follows:

excel_workbook_dataset:
    type: project.extras.datasets.OpenPyxlDataSet
    filepath: data/08_reporting/excel_workbook.xlsx
    layer: reporting

Finally, we can point any one of our nodes with potential openpyxl code to use the above data-catalog entry as its output.