Protocol Buffers, Neural Networks and Python Generators

Protocol Buffers, Neural Networks and Python Generators

TLDR: So I was working on my thesis, and wanted to implement a particular paper that I would be able to iterate upon, long story short: this paper presented a Fully-Convolutional Siamese Neural network for Change Detection. And me, being me, was not satisfied with simply cloning their model from GitHub and using it as-is. I had to implement it, using TensorFlow (instead of PyTorch), so that I could really experience the intricacies of their model. (So I did, and you can find it here but that's besides the point of this post).

12 hours and two days later, I was ready to train my model. A recent 2022 paper released a dataset of 20000 image pairs, and painstainkingly labelled masks for the purposes of training the very type of network I had wrote. So there I was, ready with data, my training loop written from scratch and a freshly brewed cup of coffee, ready to type the all-so-crucial command

python src/train.py

But then, after about 15 seconds or so, the stacktrace in my terminal immediately gave me the sense that all was not right.... Garbled, nearly unintelligible collections of words, all hinting that I was running out of memory (somehow 64 Gigabytes of system RAM and an 8GB GPU wasn't enough?!), and then, the magic error message brought my model training to a screeching halt indicating something about my "protos" did not allow for such large graph-nodes (or something along those lines).

A quick side-quest: TensorFlow 2.x default mode of operation is eager mode, when I hit run, the function runs as-is, and does not care on a low-level the command that came before or after. However, if using special decorators, there is a possibility for performance enhancement in using Graph execution, where a really smart piece of code optimally choses how to execute my hand-written code in an execution graph. To get a better understanding of this, see the documentation.

"proto"?

Now that you have an idea of what Graph-execution is, and a general idea of the error I was facing, there remains one vital gap in information: what the hell is a "proto"?! According to this stackoverflow post, Protobuf has a hard limit of 2GB, since the arithmetic used is typically 32-bit signed. As this medium post explained, TF graphs are simply protobufs. Each operation in TensorFlow are symbolic handles for graph-based operations, which are stored as Protocol Buffers. A Protocol Buffer (proto for short), are Google's language-neutral, extensible mechanism for serializing structured data. The specially generated code is used to easily read and write structured data (in this case a TensorFlow graph) regardly of data stream and programming language.

To the best of my understanding, my gigantic dataset was causing individual operations in the execution graph to exceed the proto hard-limit of 2GB, since I was using the tf.Data API and the from_tensor_slices function to keep my entire dataset in memory and perform operations from there. Now, the dataset is about 8GB large, wayyyyy smaller than my 64GB of RAM, however performing multiple layers of convolutions (not to mention, in parallel) quickly caused the entire training pipeline to shut down.

So I needed to somehow use this large dataset, but without having to keep all the images in memory, and for this, we now move to Python generators

yield

A generator function allows you to declare a function that behaves like an iterator. For example, in order to read lines of a text file, I could do the following, which loads the entire file first, then returns it as a list. The downside of this is that the entire file must be kept in memory

def csv_reader(file_name):
    lines = []
    for row in open(file_name, "r"):
        lines.append(row)

    return lines

If instead, I do the following

def csv_reader(file_name):
    for row in open(file_name, "r"):
        yield row

I could then call the csv_reader function as if it were an iterator, where the next row is loaded only when the function is called and the previous output (possibly already processed) is discarded.

So something along the lines of

next(csv_reader(file_name))

Generators and tf.Data

TensorFlow's tf.Data API is extremely powerful, and the ability to define a Dataset from a generator, is all the more powerful. So this is how I solved my issued from above, first I defined a generator for both train and validation sets:

(the preprocessing functions simply loads the image from its file path, converts them to floats and normalizes them)

import tensorflow as tf

def train_gen(split='train', data_path='data/'):
    path = data_path + split
    for t1, t2, l in zip(sorted(os.listdir(path+'/time1')), sorted(os.listdir(path+'/time2')), sorted(os.listdir(path+'/label'))):
        # get full paths

        t1 = process_path_rgb(f'data/{split}/time1/' + t1)
        t2 = process_path_rgb(f'data/{split}/time2/' + t2)
        l = process_path_grey(f'data/{split}/label/' + l)

        yield (t1, t2), l

def val_gen(split='val', data_path='data/'):
    path = data_path + split
    for t1, t2, l in zip(sorted(os.listdir(path+'/time1')), sorted(os.listdir(path+'/time2')), sorted(os.listdir(path+'/label'))):
        # get full paths

        t1 = process_path_rgb(f'data/{split}/time1/' + t1)
        t2 = process_path_rgb(f'data/{split}/time2/' + t2)
        l = process_path_grey(f'data/{split}/label/' + l)

        yield (t1, t2), l

Not that since my model is a Siamese neural network, it has two heads and therefore requires two inputs (t1 and t2 above refer to time-1 and time-2, or before-and-after, where l is the label mask indicating the areas that actually underwent change). Finally, I passed these generators to the tf.Data API calls as follows:

train_ds = tf.data.Dataset.from_generator(
    train_gen, output_types=((tf.float32, tf.float32), tf.uint8))
val_ds = tf.data.Dataset.from_generator(
    val_gen, output_types=((tf.float32, tf.float32), tf.uint8))

The following section is more for performance and batching, which again removes how much data is actually held in memory at any given point in time. The from_generator call achieves exactly what I wanted, where data is loaded on a as-needed basis, and (thus far) avoided my headache with Protocol buffers

buffer_size = 1000
batch_size = 200

train_batches = (
    train_ds
    .cache()
    .shuffle(buffer_size)
    .batch(batch_size)
    .repeat()
    .prefetch(buffer_size=tf.data.AUTOTUNE))

val_batches = (
    val_ds
    .cache()
    .shuffle(buffer_size)
    .batch(batch_size)
    .repeat()
    .prefetch(buffer_size=tf.data.AUTOTUNE))

This is a very, very problem-specific post, however it does cover some key aspects of dealing with large sets of image data, TensorFlow and Python generators. I hope that you learnt something!

For any changes, suggestions or overall comments, feel free to reach out to me on LinkedIn or on Twitter @aadiDev