Tensorflow on x86, ARM and RISC-V
see Part 2
I like collections. I like Tensorflow. I like messing around with computers.
So why not build a collection of Tensorflow enabled systems? This includes desktop PC , ARM Single Board Computer (SBC), RISC-V microcontroller.
And while we are at it, why not get a sense of what performance we can expect
More precisely, I have at my disposal:
- A not too recent x86 gaming PC . It features a 6th generation Intel Core (i7–6700HQ).
- Two ARM SBC: A raspberry PI4 (64 bit quad-core Cortex-A72) and a Nvidia Jetson Nano (64 bit quad-core Cortex-A57).
- Several RISC-V boards, based on Kendryte K210 processor (64 bit, dual core)
It is obviously unfair to do a simple comparison:
- a gaming PC is in the 1000-2000 € price range. RAM is typically 8 to 32 Gb. Always run some kind of operating system (Windows, Linux, MacOS)
- An ARM SBC is in the 50-100 € price range. RAM is typically 256Mb to a few Gb. Typically runs some kind of Linux.
- A RISC-V development board can get as low as 30 €. RAM is in the Mbytes (mine has 8Mb). Not powerful enough to run a full OS, but can run an embedded MicroPython interpreter.
In term of GPU / neural network hardware acceleration:
The x86 PC embark a GTM960 Nvidia GPU with 1024 cuda core and the Nvidia Jetson nano includes a 128 cuda core GPU. GPU can accelerate both training and inference.
The Raspberry PI4 uses Google’s Edge TPU USB neural network accelerator and the K210 chip includes its own ‘KPU’ neural network accelerator. Those accelerators are designed for inference only.
For hardware accelerator, one needs to check which type of neural networks operations are implemented. The K210’s KPU is restricted to operations used in image processing (CNN, aka Convolution neural Network). Google’s Edge TPU goes beyond CNN, and also support Recurrent Neural Networks (LSTM). But do not assume that your exotic neural network architecture will be fully supported by a given accelerator.
Let’s use image classification as our benchmark application. Image classification takes a picture in input and returns its type (e.g. it informs whether it is a dog or a cat).
But Noooo!. I will not be using the beaten to death “dog vs cat” example. Let’s be a bit more practical (who care about dogs vs cats ?). The application recognize 3 faces (me, my partner, a toddler), and the absence of any of those 3 faces.
In neural network terminology, this means 4 ‘classes’.
Standing on the shoulders of giants: Transfer learning
To build our image classification model, there is the “let’s start from scratch” way. It consists of learning how to design CNN (Convolution Neural Network — a typical network architecture for image classification), and then spending time to train the model from scratch. Here time also means electricity, as training a model from scratch can be very computing intensive.
The other option is also the “let’s reuse other people’s work” way. This is called transfer learning and it is cool. The idea is to take a model that has already been trained on a similar task, and repurpose it for our own application.
After all, the deep learning community have already spent a lot of time (and electricity) to train image classification models, using tons of input images; many of those trained models are available online.
Transfer learning works because the lower layers of the MobileNet model have learned to recognize ‘basic’ image shapes, such as edge, lines, ovals etc .. Obviously the outer layers are specialized to handle this green lizard vs chameleon business, but the lower layers perform some processing that should be generic enough to apply to our own 4 classes problem.
As for any deep learning application, we start by gathering training data, i.e. a set of images for the 4 classes at hand. And, as for any deep learning application, the more training images, the better.
But, wait, I cannot ask a toddler to stay quiet while I am taking 100’s of pictures of her.
When relatively few training images are available, a typical approach is to use data augmentation to extend the training set, by creating new, artificial (aka synthetic) images.
Those images are created by applying various random transformation (rotate, flip, crop, translate, contrast, hue, saturation, etc.) to the original images. The network to be trained will benefit from being exposed to a wider variety of input images, and will still be learning relevant features from those augmented images
Tensorflow provides an easy access to data augmentation with the Keras API. First one defines a data generator object, with the various image transformations required. Then one applies those transformation to all image files in a given directory. Augmented images will be created and stored on disk, for later use.
In my case, I have 44 original images for each classes, and I am creating 132 random synthetic images, for a total of 176 images per class. This should be enough for training the model.
The data augmentation python code is available on github
Anna joins forces with mountaineer Kristoff and his reindeer sidekick …
No, not that Frozen …
Since we are not interested in classifying lizard, or any of the objects the vanilla MobileNet model was trained on, we cannot reuse MobileNet as it (we could , but it will miserably fails at classifying our 4 classes).
MobileNet consists of two ‘blocks’:
- a convolutional base, whose purpose in life is to learn images features. The first layers learn to recognize basic shapes (edges, lines, ovals) , whereas the last layers learn to recognize ‘higher levels’ elements, such as eyes, hears ..
- a classifier, whose purpose is to generate a prediction , e.g. is the image a lizard or a chameleon.
The convolutional base is rather generic (at least its first layers), whereas the classifier is very specific to the training set.
The first step in transfer learning is to swap the original (trained) classifier with our own (still untrained) classifier:
Fortunately, Tensorflow allows this with a few line of code:
- Load from internet a trained MobileNet model, without it’s classifier (as indicated by the include_top = False
base_mobilenet = tf.keras.applications.mobilenet_v2.MobileNetV2(input_shape=IMG_SHAPE, include_top=False, weights='imagenet')
- Create a new model which combines MobileNet and our own classifier
base_mobilenet.trainable = Falseinputs = tf.keras.Input(shape=(96,96,3))# combine Mobilenet ...
x = base_mobilenet(inputs, training=False) # with our own classifier ...
x = tf.keras.layers.GlobalAveragePooling2D() (x)
outputs = tf.keras.layers.Dense(4, activation = 'softmax') (x) # to create a new model, to be trained
new_model = tf.keras.Model(inputs, outputs)
It is important that we freeze the MobileNet weights (with base_model.trainable = False), so that, when we train our model, we only update our own classifier, and do not update MobileNet’s weights.
The resulting new model looks like :
Layer (type) Output Shape Param #
input_2 (InputLayer) [(None, 96, 96, 3)] 0mobilenetv2_1.00_96 (None, 3, 3, 1280) 2257984 global_average_pooling2d_1 (None, 1280) 0dense_1 (Dense) (None, 4) 5124=================================================================
Total params: 2,263,108
Trainable params: 5,124
Non-trainable params: 2,257,984
- model’s inputs are images 96x96 pixels, with 3 colors
- the trained MobileNet is included, with 2.2 Millions parameters (aka weights)
- the model’s output is 4 classes
Note that the 2.2M parameter are ‘non-trainable’ (frozen). Only our own classifier (with 5214 parameters) can be trained.
Let’s see in a future article how to train our model with our own training set.
— — — — — Do not cross this line if you are not interested in details — — — — —
When importing MobileNet without classifier, the last layer has a dimension of 3x3x1280. This is defined in the MobileNet architecture and corresponds to 1280 matrices of dimension 3x3.
The ‘global_average_pooling2d’ layer is reducing this to a single vector of 1280 scalar, by taking the average of every 3x3 matrices. This is called flattening. This layer of 1280 scalars, combined with the output layer of 4 scalars, is our custom classifier.
The classifier is ‘fully connected’, aka ‘dense’; this means each of the 1280 neurons is connected to all 4 neurons in the last layer. Each connection is represented by a weight, aka a parameter. So the total number of weights in the classifier is (1280 +1) * 4 = 5124, which is the number of trainable parameters indicated above (the reason why 1 is added is out of scope of this discussion).
By default, each weight is represented as a 32bit floating point number. So a 2.2M parameters model will consume at least 8.8Mbytes of RAM. Not a problem for desktops, but RAM is limited on low end microcontrollers, and some do not have this amount of RAM. In future articles we shall discuss how to shrink the model size, using quantization and TensorflowLite.
See also this article for a discussion on the last layer, and how to use it to get a prediction (Softmax)