How analog in-memory computing can solve power challenges of edge AI inference
Machine learning and deep learning already are integral parts of our lives. Artificial Intelligence (AI) applications via Natural Language Processing (NLP), image classification and object detection are deeply embedded in many of the devices we use. Most AI applications are served via cloud-based engines that work well for what they are used for, such as getting word predictions when typing an email response in Gmail.
As much as we enjoy the benefits of these AI applications, this approach introduces privacy, power dissipation, latency and cost challenges. These challenges can be solved if there is a local processing engine capable of performing partial or full computation (inference) at the origin of the data itself. This has been difficult to do with traditional digital neural network implementations, in which memory becomes a power-hungry bottleneck. The problem can be solved with multi-level memory and the use of an analog in-memory compute method that, together, enable processing engines to meet the much lower, milliwatt (mW) to microwatt (uW) power requirements for performing AI inference at the network’s edge.
Challenges of Cloud Computing
When AI applications are served via cloud-based engines, the user must upload some data (willingly or unwillingly) to clouds where compute engines process the data, provide predictions, and send the predictions downstream to the user to consume.
Figure 1: Data Transfer from Edge to Cloud. (Source: Microchip Technology)
The challenges associated with this process are outlined below:
- Privacy and security concerns: With always-on, always-aware devices, there is a concern about personal data (and/or confidential information) getting misused, either during uploads or during its shelf life at data centers.
- Unnecessary power dissipation: If every data bit is going to cloud, it is consuming power from hardware, radios, transmission and potentially in unwanted computations in the cloud.
- Latency for small-batch inferences: Sometimes it may take a second or more to get a response from a cloud-based system if the data is originating at the edge. For the human senses, anything more than 100 milliseconds (ms) of latency is noticeable and may be annoying.
- Data economy needs to make sense: Sensors are everywhere, and they are very affordable; however, they produce a lot of data. It is not economical to upload every bit of data to the cloud and get it processed.
To solve these challenges using a local processing engine, the neural network model that will perform the inference operations must first be trained with a given dataset for the desired use case. Generally, this requires high computing (and memory) resources and floating-point arithmetic operations. As a result, the training part of a machine learning solution still needs be done on public or private clouds (or a local GPU, CPU, FPGA farm) with a dataset to generate an optimal neural network model. Once the neural network model is ready, the model can further be optimized for a local hardware with a small computing engine because the neural network model does not need back-propagation for inference operation. An inference engine generally needs a sea of Multiply-Accumulate (MAC) engines, followed by an activation layer such as rectified linear unit (ReLU), sigmoid or tanh depending on the neural network model complexity and a pooling layer in between layers.
The majority of the neural network models require a vast amount of MAC operations. For example, even a comparatively small ‘1.0 MobileNet-224’ model has 4.2 million parameters (weights) and requires 569 million MAC operations to perform an inference. Since most of the models are dominated by MAC operations, the focus here will be on this part of the machine learning computation – and exploring the opportunity for creating a better solution. A simple, fully connected two-layer network is illustrated below in Figure 2.
Figure 2: Fully Connected Neural Network with Two Layers. (Source: Microchip Technology)
The input neurons (data) are processed with the first layer of weights. The output neurons from the first layers are then processed with the second layer of weights and provide predictions (let’s say, whether the model was able to find a cat face in a given image). These neural network models use the ‘a dot-product’ for computation of every neuron in every layer, illustrated by the following equation (omitting the ‘bias’ term in the equation for simplification):
MemoryBottleneck In Digital Computing
In a digital neural network implementation, the weights and input data are stored in a DRAM/SRAM. The weights and input data need to be moved to a MAC engine for inference. As per Figure 3 below, this approach results in most of the power being dissipated in fetching model parameters and input data to the ALU where the actual MAC operation takes place.
Figure 3: Memory Bottleneck in Machine Learning Computation. (Source: Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks,” in ISCA, 2016.)
To put things in an energy perspective – a typical MAC operation using digital logic gates consumes ~250 femtojoules (fJ, or 10−15 joules) of energy, but the energy dissipated during data transfer is more than two orders of magnitude than the computation itself and gets in the range of 50 picojoules (pJ, or 10−12 joules) to 100pJ. To be fair, there are many design techniques available to minimize the data transfer from memory to ALU; however, the whole digital scheme is still limited by the Von Neumann architecture – so this presents a large opportunity to reduce wasted power. What if the energy to perform a MAC operation can be reduced from ~100pJ to a fraction of pJ?
Removing Memory Bottleneck with Analog In-Memory Computing
Performing inference operations at the edge becomes power efficient when the memory itself can be used to reduce power required for computation. The use of an in-memory compute method minimizes the amount of data that must be moved. This, in turn, eliminates the energy wasted during data transfer. Energy dissipation is further minimized using flash cells that can operate with ultra-low active power dissipation, and almost no energy dissipation during standby mode.
An example of this approach is the memBrain™ technology from Silicon Storage Technology (SST), a Microchip Technology company. Based on SST’s SuperFlash® memory technology, the solution includes an in-memory computing architecture that enables computation to be done where the inference model’s weights are stored. This eliminates the memory bottleneck in MAC computation since there is no data movement for the weights – only input data needs to move from an input sensor such as camera or microphone to the memory array.
This memory concept is based on two fundamentals: (a) Analog electric current response from a transistor is based on its threshold voltage (Vt) and the input data, and (b) Kirchhoff’s current law, which states that the algebraic sum of currents in a network of conductors meeting at a point is zero.
It is also important to understand the fundamental non-volatile memory (NVM) bitcell that is used in this multi-level memory architecture. The diagram below (Figure 4) is a cross section of two ESF3 (Embedded SuperFlash 3rd generation) bitcells with shared Erase Gate (EG) and Source Line (SL). Each bitcell has five terminals: Control Gate (CG), Work Line (WL), Erase Gate (EG), Source Line (SL) and Bitline (BL). Erase operation on the bitcell is done by applying high voltage on EG. Programming operation is done by applying high/low voltage bias signals on WL, CG, BL and SL. Read operation is done by applying low-voltage bias signals on WL, CG, BL and SL.
Figure 4: SuperFlash ESF3 Cell. (Source: Microchip Technology)
With this memory architecture, the user can program the memory bitcells at various Vt levels by fine-grained programming operation. The memory technology utilizes a smart algorithm to tune the floating-gate (FG) Vt of the memory cell to achieve certain electric current response from an input voltage. Depending on the requirement of the end application, the cells can be programmed in either linear or subthreshold operating region.
Figure 5 illustrates the capability of storing and reading multiple levels on the memory cell. Let’s say we are trying to store a 2-bit integer value in a memory cell. For this scenario, we need to program each cell in a memory array with one of four possible values of the 2-bit integer values (00, 01, 10, 11). The four curves below are an IV curve for each of the four possible states, and the electric current response from the cell would depend on the voltage applied on CG.
Figure 5: Programming Vt levels in ESF3 cell. (Source: Microchip Technology)
Multiply-Accumulate Operation with in-memory computing
Each ESF3 cell can be modeled as variable conductance (gm). Conductance of an ESF3 cell depends on the floating gate Vt of the programmed cell. A weight from a trained model is programmed as floating gate Vt of the memory cell, therefore, gm of the cell represents a weight of the trained model. When an input voltage (Vin) is applied on the ESF3 cell, the output current (Iout) would be given by the equation Iout = gm * Vin, which is the multiply operation between the input voltage and the weight stored on the ESF3 cell.
Figure 6 below illustrates the multiply-accumulate concept in a small array configuration (2×2 array) in which the accumulate operation is performed by adding output currents (from the cells (from multiply operation) connected to the same column (for example I1 = I11 + I21). Depending on the application, activation function can either be performed within the ADC block or it can be done with a digital implementation outside the memory block.
click for larger image
Figure 6: multiply-accumulate operation with ESF3 array (2×2). (Source: Microchip Technology)
To further illustrate the concept at a higher level; individual weights from a trained model are programmed as floating gate Vt of the memory cell, so all the weights from each layer of the trained model (let’s say a fully connected layer) can be programmed on a memory array that physically looks like a weight matrix, as illustrated in Figure 7.
click for larger image
Figure 7: Weight Matrix Memory Array for Inference. (Source: Microchip Technology)
For an inference operation, a digital input, let’s say image pixels, is first converted into an analog signal using a digital-to-analog converter (DAC) and applied to the memory array. The array then performs thousands of MAC operations in parallel for the given input vector and produces output that can go to the activation stage of respective neurons, which can then be converted back to digital signals using an analog-to-digital converter (ADC). The digital signals then are processed for pooling before going to the next layer.
This type of memory architecture is very modular and flexible. Many memBrain tiles can be stitched together to build a variety of large models with a mix of weight matrices and neurons, as illustrated in Figure 8. In this example, an 3×4 tile configuration is stitched together with an analog and digital fabric between the tiles, and data can be moved from one tile to another via a shared bus.
click for larger image
Figure 8: memBrain™ is Modular. (Source: Microchip Technology)
So far we have primarily discussed the silicon implementation of this architecture. The availability of a Software Development Kit (SDK) (Figure 9) helps with the deployment of the solution. In addition to the silicon, the SDK facilitates deployment of the inference engine.
Figure 9: memBrain™ SDK Flow. (Source: Microchip Technology)
The SDK flow is training-framework agnostic. The user can create a neural network model in any of the available frameworks such as TensorFlow, PyTorch or others, using floating point computation as desired. Once a model is created, the SDK helps quantize the trained neural network model and map it to the memory array where the vector-matrix multiplication can be performed with the input vector coming from a sensor or computer.
Conclusion
Advantages of this multi-level memory approach with its in-memory compute capabilities include:
- Extreme low power: The technology is designed for low-power applications. The first level power advantage comes from the fact that the solution is in-memory computing, so energy is not being wasted in data and weights transfer from SRAM/DRAM during computation. The second energy advantage stems from the fact that flash-cells are being operated in subthreshold mode with very low current values so active power dissipation is very low. The third advantage is there is almost no energy dissipation during standby mode since the non-volatile memory cell doesn’t need any power to keep the data for always-on device. The approach is also well suited for exploiting sparsity in weights and input data. The memory bitcell does not get activated if the input data or weight is zero.
- Lower package footprint: The technology uses a split-gate (1.5T) cell architecture whereas an SRAM cell in a digital implementation is based on a 6T architecture. In addition, the cell is a much smaller bitcell compared to a 6T SRAM cell. Plus, one cell cell can store the whole 4-bit integer value, unlike an SRAM cell that needs 4*6 = 24 transistors to do so. This provides a substantially smaller on-chip footprint.
- Lower development cost: Because of memory performance bottlenecks and von Neumann architecture limitations, many purpose-built devices (such as Nvidia’s Jetsen or Google’s TPU) tend to use smaller geometries to gain performance per watt, which is an expensive way to solve the edge AI computing challenge. With the multi-level memory approach using analog on-memory compute methods, computation is being done on-chip in flash cells so one can use bigger geometries and reduce mask costs and lead times.
Edge computing applications show great promise. Yet there are power and cost challenges to solve before edge computing can take off. A major hurdle can be removed by using a memory approach that performs computation on-chip in flash cells. This approach takes advantage of a production-proven, de facto standard type of multi-level memory technology solution that is optimized for machine learning applications.
Vipin Tiwari has more than 20 years of extensive experience in product development, product marketing, business development, technology licensing, engineering management and memory design. Currently Mr. Tiwari is Director, Embedded Memory Product Development, at Silicon Storage Technology, Inc., a Microchip Technology company.