Our Technology

Talkbox comprises two components.

First is a cloud pipeline that can train deep neural networks and optimize them for on-device execution. We use techniques such as quantization and nn-pruning to reduce the space requirement and also the execution time of the neural network. Our pipeline ensures negligible impact on the accuracy metric.

Second is a runtime that executes the optimized version of these neural networks via tight integration with the underlying GPU chipset in order to maximize performance. Performance is characterized through the metrics of power consumption (battery draw), CPU utilization and memory use. Our runtime allows optimum use of these resources thus, making the deep neural network always available on-device even when offline.

Quantization.

Quantization refers to the idea of using a smaller resolution on the weight variables where possible. We utilize the tensorflow quantization system to use int16 or int32 instead of a double or a float for the internal weights of the network. Our pipeline compares the impact to ensure that the accuracy or the objective function of the network is negligibly impacted.

quantization
hardware-optimizations

Hardware optimizations.

Our runtime ensures that we take advantage of any GPU optimizations to run the neural network efficiently. The Adreno GPU line, part of the Snapdragon processors, exposes these functions via their neural network library which we take advantage of during runtime.

NN-pruning.

This refers to the technique of pruning chunks of a rather sparse neural network. NN-pruning is a well known technique that can result in significant reduction in neural network size and improve efficiency.

nn-pruning