posted by Amitabh Yadav

| In-datacenter performance analysis of a TPU

Norman P. Jouppi et al. 2017. In-Datacenter Performance Analysis of a Tensor Processing Unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA ‘17). Association for Computing Machinery, New York, NY, USA, 1–12. https://doi.org/10.1145/3079856.3080246

Synopsis

The Tensor Processing Unit (TPU):

What is so special about TPU compared to it’s contemporary GPUs and CPUs deployed in datacenters for production Neural Network (NN) applications?

What is meant by 99th percentile response time requirement for neural networks?

This means that 99% of the computations of a neural network are taking at most *t* ms/ns response time, whereas 1% of the computations are experiencing at least *t* ms/ns response times.

Why percentiles?

A percentile gives us a better sense of our real performance, because it shows us a slice of response time curve. Especially for this reason, percentiles are perfect for automatic baselining and behavioral learning. Moreover they can also help in optimizing the application with a proper focus. For example, let’s assume that generally something within an application is too slow and you need to make it faster. In this case, you need to focus on bringing down the 90th percentile. This would ensure that the overall response time of the application goes down. Example 2: If the 50th percentile moves from 500ms to 600ms, we know that 50% of our transactions suffered a 20% performance degradation. You need to improve that, it is clear.

How to compute the 99th percentile response time?

Log all the requests and their response time. Sort them in the ascending order of times they took to complete. For 95th percentile, take the (mean/max/weightedAvg) of the worst 5% of the times they took to complete. This value is your 95th percentile response time.

Difference between throughput and latency?

Throughput is the number of data packets being successfully sent per second, and latency is the actual time those packets are taking to get there.

Remember that DNNs are applicable to a wide range of problems, so we can reuse a DNN-specific ASIC for solutions in speech, vision, language, translation, search ranking, and many more.

Popular Neural Network in application today are:

  1. Multi-Layer Perceptrons (MLP) (61% of datacenter workload)
  2. Recurrent Neural Networks (RNN) (29% of datacenter workload)
  3. Convolutional Neural Networks (CNN) (Just 5%! of datacenter workload)

A quick introduction to Neural Network A neuron in a neural network is the unit that performs weighted-sum and is followed by a non-linear activation function (such as, max(0, value)). Many neurons are arranged in layers where output of one layer becomes input of the next. This forms a neural network. When many such layers are used for processing large amount of data sets, this becomes a deep neural network. Using extra and larger layers helps to capture higher levels of patterns or concepts in the data. This is compute instensive.

The two phases of NN are called training (or learning) and inference (or prediction). The developer chooses the number of layers and the type of NN, and training determines the weights. Virtually all training today is in floating point, which is one reason GPUs have been so popular. A step called quantization transforms floating-point numbers into narrow integers - often just 8-bits - which are usually good enough for inference.

Why Quantization? 8-bit integer multiplication can be 6x less energy and 6x less area efficient than IEEE-754 16-bit floating-point multipliplication, and the advantage for integer addition is 13x in energy and 38x in area reductions.

TPU: Origin, Architecture and Implementation

Strengths

It sheds light on the application of domain specific accelerators by providing data on the application of Tensor Processing Unit (TPU) for a datacenter use case.

Weaknesses

Thoughts

Takeaways

Favourite bits

Suggested Reading (optional)

Case study (optional)

References



Share to:
Last Updated: March 21, 2023
  blog | research summaries