Tensorflow on x86, ARM and RISC-V
See Part 1
So lets’ train our model. My training set consists of 704 images corresponding to 4 classes.
Since I am using Transfer Learning (and not training from scratch), I can happily run training locally on my PC, even though I don’t have the fastest GPU in the world.
Before starting training, let’s see , for fun, how the model performs ‘as it’, meaning with our untrained classifier
initial accuracy(test): 0.27
No miracle. The accuracy is 0.27, meaning the model ‘gets it right’ 27% of the time. With 4 classes, this is nothing more than random. The model has not learned anything yet.
22/22 [==============================] - 2s 76ms/step - loss: 0.5557 - accuracy: 0.8054 - val_loss: 0.3926 - val_accuracy: 0.8958
Oops… that was fast. After only 8 ‘training loops’ (epoch), we already get 90% accuracy and each epoch takes only 2 seconds on my old GPU (960M). I am using Tensorflow 2.8.
This is the coolness of Transfer Learning.
Remember, here we only train the classifier’s 5124 parameters.
We could stop training here, but we have one more trick in our bag.
As explained in part 1, the first layers in the convolution block learn basic image features (edges, lines ..) whereas the last layers learn features more specific to the problem at hand (4 different faces).
The idea of fine tuning is to continue training, but unfreeze some of the convolution last layers. This will “force” the model to learn images features that are more specific to our 4 classes.
This is to be done wisely, as the more unfreezing the more parameters to train, and the more we discard what MobileNet already learned.
Simply unfreeze some of the model’s last layers with …
base_model.trainable = Falsefor layer in base_model.layers[fine_tune_at:]: layer.trainable = True
, resume training and after a few more epochs, we get a 100% accuracy.
22/22 [==============================] - 3s 114ms/step - loss: 0.0124 - accuracy: 0.9986 - val_loss: 0.0070 - val_accuracy: 1.0000
Now that the model is trained, we can run some inference (prediction) . Let’s start with the x86 PC
When presented with a single image, the model takes 75ms to generate a predictions (i.e. ~ 13 predictions / second).
h5 size is the size of the the trained model, ie 16Mb
When presented with a batch of 64 images, and generating 64 classifications at once, the inference time is, in average, much smaller than when processing a single image. This has to do with set up time in Tensorflow.
Note that for many applications, batching multiple inputs is not possible and the application needs to run inferences as single input comes (eg. run prediction for every frame of a real time video stream).
Turn on the Lite
In the test above, we ran our model on a laptop. However, laptops (or servers) are not the only computing platforms out there. By far. Let’s think of mobile devices, or even very low cost, embedded, edge devices.
Mobile/embedded devices have memory constraints, and are typically running on battery, and therefore need to optimize their power consumption.
For those devices, a specific flavor of Tensorflow exists: Tensorflow Lite
TFlite is designed for inference (not training) and focusses on :
- latency & privacy (on device inference, so no need for any cloud/internet connectivity)
- reduced model size (reduced memory requirements)
- and optimized power consumption (can use modest CPU)
One key technique to reduce model size is post training quantization.
As explained in Part 1, a trained model is just a bunch of parameters stored as 32 bits floating point numbers. Quantization will convert those numbers to 16 bit floating point, or even 8 bits integers. This reduces model’s size and processing requirements at the expense of some accuracy degradation.
The key quantization techniques are listed below. Each option is optimized for a specific hardware type.
In particular the Full integer quantization (aka Integer only) is designed for low end devices, such as micro-controller without floating point support.
Google’s deep learning accelerator (Edge TPU) also requires such a integer only model
The table above refers to POST TRAINING quantization. i.e. a trained Tensorflow model is converted off line to its quantized version.
It is also possible to use QUANTIZE AWARE training. However, with TF 2.8, this is not supported with transfer learning.
Note that it is possible to use TFlite ‘as it’ (i.e. without quantization). The model will still be optimized for mobile/edge devices, but will use 32bits floating points parameters, overlooking the reduction in memory requirement.
The python application available on github trains a model using transfer leaning, and then converts it into TFlite, with and without quantization.
lili_transfert_learning.py -ptransfert -b
-ptransfert: train model with transfer learning, create TFlite models with and without quantization
-b: execute benchmark after training
lili is the name of my partner
The full_model.h5 file is the trained model before conversion to TFlite. Converted TFlite models have a .tflite extension (one .tflite file per quantization)
- fp32 is the TFlite model without quantization (ie using 32bits floating point). The model size is significantly reduced (8Mb vs 16Mb), without impact on accuracy. Inference time is also improved
- GPU uses 16bits floating point quantization. As expected, the model size is half what it is with fp32
- TPU uses 8 bit integer quantization. Again, the model size is decreased. However, inference time is thru the roof. To be honest, I am not sure what is going on here, and I guess it has to do with the fact that x86 is not optimized to run int8 operations (after all, who in its right mind, would use 8 bit on a 64 bits CPU?). Let’s try the 8 bit model on the Edge TPU accelerator
To execute on the Edge TPU, the model needs to be converted from the TFlite format to the Edge TPU format, using the edgetpu_compiler (which is available only on Linux).
(from linux)$ edgetpu_compiler -s -m13 TPU.tflite
From the edgetpu_compiler’s output, we can see that all operations are mapped to the TPU, i.e. executed in hardware. This is not a surprise since the Edge TPU supports convolutional neural network (CNN), and Mobilenet is a CNN. In case the model uses operations not implemented by the Edge TPU, execution for those would fall back to the CPU, defeating the purpose of hardware acceleration.
With the EDGE TPU connected to a PC USB port, the inference time is much more sensible. It is faster than the non quantized model (ie 2ms vs 3.3 ms), with only 1/3 of the model size, and without any impact on accuracy.
Metadata can be included in the TFlite file. Amongst other things, this allows an application using the model to dynamically retrieve model’s input and output format.
For instance, the model below expects images of 96 pixels x 96 pixels, with 3 colors, and each pixel is encoded as uint8 (8 bit unsigned integer) while the output is an array of four uint8 (one per class).
In the next articles, I shall port my benchmark application from Windows to ARM/Linux and RISC-V to see how its performs
— — — — — Do not cross this line if you are not interested in details — — — — —
Tensorflow Lite for micro-controller is a Tensorflow flavor designed for embedded devices which do not have the resources to run an operating system such as Windows or Linux, and have very limited memory (RAM in the Mb range, not in the Gb range).
With TFlite for micro-controller, deep learning models can be deployed anywhere, specially on the edge, where interface with the physical world happens. This is called ‘TinyML’.
For details, see one of my experiments with TFlite for micro-controller.
However, it is not because one has a deep learning hammer that the world should look like a nail