TLDR: So I was working on my thesis, and wanted to implement a particular paper that I would be able to iterate upon, long story short: this paper presented a Fully-Convolutional Siamese Neural network for Change Detection. And me, being me, was not satisfied with simply cloning their model from GitHub and using it as-is. I had to implement it, using TensorFlow (instead of PyTorch), so that I could really experience the intricacies of their model. (So I did, and you can find it here but that's besides the point of this post).
12 hours and two days later, I was ready to train my model. A recent 2022 paper released a dataset of 20000 image pairs, and painstainkingly labelled masks for the purposes of training the very type of network I had wrote. So there I was, ready with data, my training loop written from scratch and a freshly brewed cup of coffee, ready to type the all-so-crucial command
python src/train.py
But then, after about 15 seconds or so, the stacktrace in my terminal immediately gave me the sense that all was not right.... Garbled, nearly unintelligible collections of words, all hinting that I was running out of memory (somehow 64 Gigabytes of system RAM and an 8GB GPU wasn't enough?!), and then, the magic error message brought my model training to a screeching halt indicating something about my "protos" did not allow for such large graph-nodes (or something along those lines).
A quick side-quest: TensorFlow 2.x default mode of operation is eager mode, when I hit run, the function runs as-is, and does not care on a low-level the command that came before or after. However, if using special decorators, there is a possibility for performance enhancement in using Graph execution, where a really smart piece of code optimally choses how to execute my hand-written code in an execution graph. To get a better understanding of this, see the documentation.
"proto"?
Now that you have an idea of what Graph-execution is, and a general idea of the error I was facing, there remains one vital gap in information: what the hell is a "proto"?! According to this stackoverflow post, Protobuf has a hard limit of 2GB, since the arithmetic used is typically 32-bit signed. As this medium post explained, TF graphs are simply protobufs. Each operation in TensorFlow are symbolic handles for graph-based operations, which are stored as Protocol Buffers. A Protocol Buffer (proto for short), are Google's language-neutral, extensible mechanism for serializing structured data. The specially generated code is used to easily read and write structured data (in this case a TensorFlow graph) regardly of data stream and programming language.
To the best of my understanding, my gigantic dataset was causing individual operations in the execution graph to exceed the proto hard-limit of 2GB, since I was using the tf.Data
API and the from_tensor_slices
function to keep my entire dataset in memory and perform operations from there. Now, the dataset is about 8GB large, wayyyyy smaller than my 64GB of RAM, however performing multiple layers of convolutions (not to mention, in parallel) quickly caused the entire training pipeline to shut down.
So I needed to somehow use this large dataset, but without having to keep all the images in memory, and for this, we now move to Python generators
yield
A generator function allows you to declare a function that behaves like an iterator. For example, in order to read lines of a text file, I could do the following, which loads the entire file first, then returns it as a list. The downside of this is that the entire file must be kept in memory
def csv_reader(file_name):
lines = []
for row in open(file_name, "r"):
lines.append(row)
return lines
If instead, I do the following
def csv_reader(file_name):
for row in open(file_name, "r"):
yield row
I could then call the csv_reader
function as if it were an iterator, where the next row is loaded only when the function is called and the previous output (possibly already processed) is discarded.
So something along the lines of
next(csv_reader(file_name))
Generators and tf.Data
TensorFlow's tf.Data
API is extremely powerful, and the ability to define a Dataset from a generator, is all the more powerful. So this is how I solved my issued from above, first I defined a generator for both train and validation sets:
(the preprocessing functions simply loads the image from its file path, converts them to floats and normalizes them)
import tensorflow as tf
def train_gen(split='train', data_path='data/'):
path = data_path + split
for t1, t2, l in zip(sorted(os.listdir(path+'/time1')), sorted(os.listdir(path+'/time2')), sorted(os.listdir(path+'/label'))):
# get full paths
t1 = process_path_rgb(f'data/{split}/time1/' + t1)
t2 = process_path_rgb(f'data/{split}/time2/' + t2)
l = process_path_grey(f'data/{split}/label/' + l)
yield (t1, t2), l
def val_gen(split='val', data_path='data/'):
path = data_path + split
for t1, t2, l in zip(sorted(os.listdir(path+'/time1')), sorted(os.listdir(path+'/time2')), sorted(os.listdir(path+'/label'))):
# get full paths
t1 = process_path_rgb(f'data/{split}/time1/' + t1)
t2 = process_path_rgb(f'data/{split}/time2/' + t2)
l = process_path_grey(f'data/{split}/label/' + l)
yield (t1, t2), l
Not that since my model is a Siamese neural network, it has two heads and therefore requires two inputs (t1 and t2 above refer to time-1 and time-2, or before-and-after, where l is the label mask indicating the areas that actually underwent change). Finally, I passed these generators to the tf.Data
API calls as follows:
train_ds = tf.data.Dataset.from_generator(
train_gen, output_types=((tf.float32, tf.float32), tf.uint8))
val_ds = tf.data.Dataset.from_generator(
val_gen, output_types=((tf.float32, tf.float32), tf.uint8))
The following section is more for performance and batching, which again removes how much data is actually held in memory at any given point in time. The from_generator
call achieves exactly what I wanted, where data is loaded on a as-needed basis, and (thus far) avoided my headache with Protocol buffers
buffer_size = 1000
batch_size = 200
train_batches = (
train_ds
.cache()
.shuffle(buffer_size)
.batch(batch_size)
.repeat()
.prefetch(buffer_size=tf.data.AUTOTUNE))
val_batches = (
val_ds
.cache()
.shuffle(buffer_size)
.batch(batch_size)
.repeat()
.prefetch(buffer_size=tf.data.AUTOTUNE))
This is a very, very problem-specific post, however it does cover some key aspects of dealing with large sets of image data, TensorFlow and Python generators. I hope that you learnt something!
For any changes, suggestions or overall comments, feel free to reach out to me on LinkedIn or on Twitter @aadiDev