Tensor Processing Unit : Architecture, Working & Its Applications
As we know that graphics processing units are very fast as compared to central processing units in machine learning. So a new chip was developed in 2016 by Google to optimize machine learning tensor operations namely TPU or Tensor Processing Unit. TPU has been arranged for all Google data centers to power applications like Google Search, Google Photos, Google Translate & Maps. Google’s tensor processing units are accessed in two forms edge TPU & Cloud TPU. Edge TPU is a specially made development kit, used in specific applications whereas Cloud TPUs are accessed from the Colab notebook of Google, which gives TPU pods to the consumers that sit on data centers of Google. So this article gives brief information on TPU or tensor processing unit.
What is Tensor Processing Unit?
A tensor processing unit is a specialpurpose accelerator or processing IC used to speed up the machine learning workloads. This chip was designed by Google to handle neural network processes with TensorFlow software. Tensor processing units are ASICs mainly used for accelerating particular machine learning workloads with processing elements like small DSPs including local memory. An opensource platform like TensorFlow is used for machine learning in the detection of objects, classification of images, speech recognition, language modeling, etc.
Tensor Processing Unit Architecture
A tensor processing unit is an applicationspecific integrated circuit used to implement hardware circuits. This is based on the instruction set of a CISC (Complex Instruction Set Computer) that executes highlevel instructions to run difficult tasks intended for training deep neural networks. The heart of the tensor processing unit architecture exists in the systolic arrays to optimize the operations of the matrix. The tensor processing unit architecture is shown below.
The tensor processing unit is programmable similar to a GPU/ CPU. It is mainly designed for just a single neural network model to execute CISC instructions on several networks like LSTM models, convolutional, large & completely connece3wted models. Thus it is still programmable, although utilizes a matrix like a primitive in place of a scalar or vector.
The tensor processing unit architecture includes three computational resources MXU, UB, and AU.
 The MXU is a Matrix Multiplier unit that includes 65. 536 8bit multiply & add units which are used for matrix operations.
 UB is a Unified Buffer that includes 24MB SRAM which functions as registers
 AU is an activation unit and it is the Hardwired activation function.
In a tensor processing unit, five main highlevel instruction sets have been created to handle the function of the resources. So, the five major highlevel instruction sets are discussed below.
 Read_Host_Memory is used to read the data from the memory
 Read_Weights is used to read “Read weights” from memory
 MatrixMultiply/Convolve instruction is used for convolving/multiplying with the data & weights to collect the results
 Activate is used to apply activation functions.
 Write_Host_Memory is used to write results to memory.
The instructions of the tensor processing unit are transmitted from the host over the PCIe Gen3x16 (peripheral component interconnect express) bus into an instruction buffer. The internal blocks in this architecture are normally connected by 256bytewide paths together.
In the upperright corner side of the architecture, the matrix multiply unit (MMU) is the heart of the tensor processing unit with 256×256 MACs that can execute 8bit multiply & adds on signed otherwise unsigned integers.
Under the matrix unit, the 16bit products are gathered in the 4 megabytes of 32bit accumulators. The four MiB signifies 4,096, 256element, and 32bit accumulators. Here, the matrix unit generates a single 256element partial sum for each cycle.
For the matrix unit, the weights are mounted throughout an onchip Weight FIFO, so that it reads from an offchip 8gigabyte DRAM known as weight memory. So, for inference, weights are readonly; 8 gigabytes simply support several active models simultaneously. The weight FIFO or weight fetcher is four tiles deep. So, the intermediate outcomes are held within the onchip 24 MiB unified buffer that can provide inputs to the Matrix Unit.
A programmable DMA controller simply transfers data from or to the Host memory of the CPU & the Unified Buffer. To be able to arrange dependably at the Google scale, the internal & external memory comprises inbuilt error correction & detection hardware.
The floor plan of the tensor processing unit die is shown below. Here, 24 MiB UB (Unified Buffer) is nearly a third of the die & the MMU (Matrix Multiply Unit) is a quarter, consequently, the data path is almost twothirds of the die. The size of 24 MiB was chosen partially to be equivalent to the Matrix Unit’s pitch on the die & given the short development schedule partially to shorten the compiler.
How Does Tensor Processing Unit Work?
Tensor Processing Unit is an ASIC specially built for machine learning & customized for TensorFlow to handle huge multiplications as well as additions for NN (neural networks) at maximum speeds while decreasing the utilization of too much floor space & power. Tensor processing units execute mainly three steps.
 In the first step of TPU, the parameters from memory are loaded into the matrix of adders & multipliers.
 In the second step, the data is simply loaded from memory.
 After each multiplication process in the third step, the outcomes are simply passed on next multipliers while performing summation at a similar time. After that, the output is provided as the summation of all multiplication which results in data & parameters.
A typical cloud tensor processing unit includes two systolic arrays in 128 x 128 size with 32,768 ALUs for 16bit floatingpoint values within a single processor. Hundreds of adders & multipliers are simply connected directly to each other to make a big physical matrix of operators to form systolic array architecture.
The tensor processing unit simply permits the chip to be more liberal to decreased computational accuracy, which means it needs fewer transistors for each operation. So due to this feature, a single chip handles fairly more operations for each second.
As tensor processing units are custom built to handle different operations like accelerating the training & matrix multiplications, and then these processing units are not appropriate to handle other types of workloads.
TPU Vs GPU
The difference between TPU Vs GPU is discussed below.
TPU 
GPU 
The term TPU stands for “tensor processing unit”.  The term GPU stands for “Graphical Processing Unit”. 
TPU is an applicationspecific integrated circuit or ASIC.  GPU is a specialized electronic circuit. 
TPUs have the capability to work faster as compared to GPUs while using fewer resources.  GPUs have the capability to break difficult problems into separate tasks & work them out all right away. 
A single Cloud TPU chip includes two cores based on Google which uses MXUs to speed up the programs through dense matrix calculations.  A GPU includes hundred to thousands of cores. In a GPU, the calculations can be carried out within these cores. So, the performance of GPU also mainly depends on the cores it has. 
A TPU performs 4 trillion operations for each second by using simply 2w of power.

A GPU consumes up to 350 w of power.

TPU Vs CPU
The difference between TPU Vs CPU includes the following.
TPU 
CPU 
TPU is the optimized accelerator.  CPU is a generalpurpose processor. 
TPUs are three times faster as compared to CPUs.  CPUs are slower as compared to TPUs. 
These are very efficient processors used to execute a project within a Tensorflow framework.  CPU is used to handle all of the computer input/output, and logic calculations.

A powerful CPU will ensure the smooth execution of programs & tasks.

A powerful TPU allows you to speed up algorithms & calculations for Machine Learning, Artificial Intelligence & Deep Learning. 
Advantages
The advantages of the tensor processing unit include the following.
 Tensor processing unit offers many benefits like increasing speed of computation & efficiency
TPU enhances the linear algebra computation performance which is most frequently used in machine learning applications.  It reduces the time to accuracy while training huge and difficult machine learning models.
 It allows you to scale computations across numerous machines without writing any code.
 Present versions of TPU can train models within hours.
 The power consumption is extremely low for each operation, so it makes a tensor processing unit is extremely efficient for their usecases.
 It has extremely high memory which supports larger inputs as compared to GPUs.
Disadvantages
The disadvantages of the tensor processing units include the following.
 TPUs are very expensive as compared to CPUs & GPUs.
 At present, TPU supports tensor flow only.
 Particular tensor flow operations are not supported like customer operations written within C++.
 TPU doesn’t support various types of operations like the GPU.
 There are no substitutes for Google’s tensor processing unit.
 The calculations of TPU are not exactly like a GPU/ CPU.
 These are wellmatched with just Linux; the Edge TPU is compatible with a particular Debianderivative OS.
Tensor Processing Unit Applications
The tensor processing unit applications are discussed below.
 Google’s ASIC like tensor processing units are used for machine learning.
 These are particularly utilized for deep learning to resolve complex matrix & vector operations.
 TPUs are efficient to solve matrix & vector operations at ultrahigh speeds however they must be connected with a CPU to provide and perform instructions.
 These are used in different deep learning applications like computer vision, ecommerce, fraud detection, selfdriving cars, natural language processing, agriculture, vocal AI, stock trading, virtual assistants & a variety of social predictions.
Thus, this is all about an overview of a tensor processing unit or TPU. As compared to CPU & GPU, the TPU performance is better and it can handle operations up to 128000 for each cycle. Here is a question for you, what is GPU?