Multicore hardware accelerators

Since current solutions based on standard GPU cores do not provide satisfactory results in terms of power and memory consumption, especially in the field of consumer electronics, a growing number of companies are starting to develop their own custom power-optimized machine learning hardware accelerators.

In order to support running neural network models on such low-power 8-bit machine learning accelerators, SYRMIA's team of Machine Learning experts implemented a tool for full quantization of floating point models to 8-bit integer arithmetic, which included quantization of model parameters and quantization of layer activations. The tool supports state-of-the-art quantization techniques such as asymmetric quantization of layer activations and per-axis quantization of convolution weights. We also implemented various model optimizations techniques such as removal of constant operators, merging bias add, relu activation and batch normalization with 2D convolution operators etc.

The team has created a custom library of ML opeators in floating point and 8-bit integer arightmetic in python based on numpy library, supporting all TFLite operators. Model quantization tool and custom operators library have been verified against Tensorflow reference with a number of models including MNIST CNN, VGG16, ResNet-50, FCN, Yolo, SSD, and SqueezeNet.

Apart from this, the team has implemented a number of hand-optimized ML operators using a custom instruction set based of 8-bit integer arithmetic, including:

  • Matrix multiplication
  • 2D Convolution
  • 2D Max poooling and Average pooling
  • Relu activation
  • Elementwise addition, subtraction and multiplication
  • Tensor concatenation

To ensure correctness of all implemented operators, they have been thoroughly tested using python test harness.

Together with our customer we have developed a cycle-exact simulator of their custom 8-bit machine learning accelerator chip. We have successfully run VGG16, ResNet-50, Squeezenet, GoogleNet and FCN models in the simulator using the implemented low-level operators.

Machine learning frameworks

SYRMIA's Machine Learning team also has experience in using and modifying leading Machine Learning frameworks such as Tensorflow, Pytorch and ONNX, for various customer needs.

We have extended Tensorflow library by implementing various ML operators and corresponding gradients in floating point, bfloat16 and 16-bit fixed-point arithmetic. The implemented operators were integrated into Tensorflow library by adding new virtual devices for each arithmetic type.

All operators were optimized for Intel-compatible CPU's and paralellized using SSE2 instruction set. Matrix multiplication and 2D convolution operators and their gradients were also optimized for NVidia GPU's using CUDA. The team has adapted several neural networks to work with new operators (ResNet, LSTM, seq2seq, NMT, Word2vec). The networks were tested in both inference and training mode. We have also performed experiments with TFlite library and measured accuracy of symmetric and asymmetric quantization for various neural networks.