本文是计算机专业的留学生作业范例,题目是“Artificial Intelligence Chips & Comparison with CPUs and GPUs(人工智能芯片&与cpu和gpu的比较)”,随着人工智能和机器学习算法的快速发展,对高性能、低功耗处理器的需求日益增长。为了在更短的时间内完成机器学习(ML)算法,需要较高的计算能力,以便在更短的时间内高效地完成ML算法的基本操作。
由于机器学习算法涉及复杂的数学计算,处理器被设计成在一个时钟周期内执行这些计算,以快速训练模型。本文讨论了各种处理器,已被用于实现机器学习算法。
Abstract 摘要
With the rapid growth of artificial intelligence and machine learning algorithms, there is an ever-increasing demand of processors with high performance and low power consumption. To execute machine learning (ML) algorithms in less time, high computing power is required that can perform the basic operations of ML algorithms efficiently and in less time. Since the machine learning algorithms involve complex mathematical calculations, processors are being designed to perform these computations in a single clock cycle to train the model quickly.
This paper discusses various processors that have been used to implement machine-learning algorithms. The paper further discusses the need for application-specific processes, followed by an example of Artificial Intelligence (AI) Accelerator i.e. Tensor Processing Unit (TPU). The paper concludes with a brief comparison between the central processing unit (CPU), the graphic processing unit (GPU) and the AI accelerator.
本文进一步讨论了特定应用过程的需求,并以人工智能(AI)加速器即张量处理单元(TPU)为例。最后对中央处理器(CPU)、图形处理器(GPU)和人工智能加速器进行了简单的比较。
1.INTRODUCTION引言
In view of the increasing volume and range of available data, the need for statistical analysis is becoming essential for the delivery of in-depth information that is cheap and readily available. By using artificial intelligence and machine learning, algorithms that can process larger, more complex data can be programmed to produce faster and more accurate results. By determining the specific models to avoid unknown risks, companies are finding lucrative opportunities to grow their business. These algorithms help the companies to bridge the gap between their services and customers with better decisions and less human intervention [1].
鉴于现有数据的数量和范围不断增加,对统计分析的需要对于提供廉价和随时可得的深入资料变得至关重要。通过使用人工智能和机器学习,可以对能够处理更大、更复杂数据的算法进行编程,从而产生更快、更准确的结果。通过确定特定的模式来避免未知的风险,公司正在寻找利润丰厚的机会来发展他们的业务。这些算法帮助公司通过更好的决策和更少的人工干预,弥合服务和客户之间的鸿沟。
Artificial Intelligence can be defined as a wide-ranging branch of computer science concerned with building smart machines capable of performing tasks that typically require human intelligence [2]. Building expert systems that can advise its users and developing systems that can exhibit human-like behavior are the main goals of artificial intelligence systems. Figure 1 enlists the different ways to develop artificial intelligence systems. The fields described in Figure 1 perform a large number of computations to train a system with large data sets, to execute a large number of computations; the system requires high computing power. Therefore, when developing AI systems, computing power becomes a limiting factor. E.g. In order to recognize a cat in a photo, it may be appropriate for a deep learning algorithm to see millions of images before its artificial neurons "remember" to recognize a cat. Therefore, chips explicitly for the artificial intelligence field are designed to accelerate the development of AI systems in order to provide higher performance when executing complex computations and consuming low power.
2.HISTORY FOR AI ACCELERATION AI加速的历史
With the increasing demand and popularity of artificial intelligence and machine learning post the second AI winter [4], various processors and microcontrollers were used for acceleration of developing AI systems and machine learning models. As deep learning and machine learning workloads rose in prominence within the 2010s, specialized hardware units were developed or tailored from existing hardware to accelerate these tasks.
第二届人工智能冬季[4]后,随着人工智能和机器学习的需求和普及,各种处理器和微控制器被用于加速人工智能系统和机器学习模型的开发。随着深度学习和机器学习的工作量在2010年代的显著增加,专门的硬件单元被开发出来或根据现有的硬件进行定制,以加速这些任务。
In the 1990s, digital signal processors were used as neural network accelerators and FPGA-based (field-programmable gate arrays) accelerators were developed to explore their use for training and inference [3]. In the 2000s with the advent of central processing units (CPUs) that have features beneficial for developing artificial intelligence systems like fast access to memory, arithmetic, and logical computing, etc. replaced the digital signal processors to function as neural network accelerators. Graphics Processing Units (GPUs) are electronic circuits used for processing images, video, and animations. The mathematical foundation for neural networks and object processing is identical, resulting in the growing use of GPUs for machine learning tasks. Because of GPU's increasing popularity in machine learning and artificial intelligence, they continue to evolve towards promoting machine learning and deep learning operations, both in training and inference [3].
To provide increased programmability and develop a dedicated inference accelerator with short latencies, fast code porting, and support for major deep learning frameworks, reconfigurable devices like field-programmable gate arrays (FPGA) and chips like application-specific integrated circuit (ASIC) can be used. Reconfigurable devices like field-programmable gate arrays (FPGA) make it easier to evolve the hardware based on the needs of the AI system. Although the performance of GPUs and FPGAs is much better than that of CPUs for machine learning, an application-specific integrated circuit (ASIC) can achieve a significant efficiency factor with a more specific design. Developing chips dedicated to deep learning or machine learning i.e. AI Chips will further boost the efficiency of developing AI systems.
2.1 Computational Paradigms in AI
Though developing AI systems is synonymous to traditional computing, they also need advanced computing technologies like:
Unstructured Data: The datasets used for developing AI systems or machine-learning models most of the time contain unstructured data such as image, video, voice. Therefore, the models must be trained by sample training, fitting and then the trained models must be used to process the data.
Parallel Processing: A large amount of computation is usually required for processing and training the model. The basic calculation is essentially linear algebraic operations, such as large matrix multiplication. Massively parallel computing hardware is more suited for such operations than traditional universal processors.
Near-memory computation: Parameters are huge, requiring enormous storage capacity, high bandwidth and low memory access capacity for latency. Localization of data is prominent and suitable for the reuse of data.
3.ARTIFICIAL INTELLIGENCE CHIPS人工智能芯片
AI chips or AI Accelerators are application-specific processors that are designed for training machine learning models and using these trained models for inference. AI Accelerators are less general-purpose as compared to GPUs. However, it excels at computing matrix multiplications; neural network layer's output error and propagating the computed error to the neighboring layers. Furthermore, the AI accelerator reduces the time required to develop an AI system as compared to CPUs and GPUs. The majority of AI accelerators or chips are designed using FieldProgrammable Gate Arrays (FPGA) or using Application-Specific Integrated Circuits (ASIC).
人工智能芯片或人工智能加速器是应用特定的处理器,设计用于训练机器学习模型,并使用这些训练过的模型进行推理。与gpu相比,AI加速器的用途更少。然而,它擅长计算矩阵乘法;神经网络层输出误差,并将计算误差传播到相邻层。此外,与cpu和gpu相比,AI加速器可以减少开发AI系统所需的时间。大多数人工智能加速器或芯片采用现场可编程门阵列(FPGA)或应用专用集成电路(ASIC)设计。
In an environment transitioning to workload-optimized AI systems, companies will be more inclined towards the fastest, most flexible, most power-efficient and lowest-cost hardware technology, to execute their artificial intelligence or machine learning tasks including development, learning, and inference. The numerous AI chip-set architectures available in the market represent the diversity of machine learning, deep learning, natural language processing, and other AI workloads ranging from storage-intensive training to compute-intensive inference. In order to address the variety of workloads used to support AI chipsets, manufacturers combine a wide range of technologies in their product portfolios and even in embedded AI implementations.
Examples of AI Chips are Google Tensor Processing Unit (TPU), Intel Nervana, Qualcomm AI Chip, LG Neural Engine and AWS Inferentia.
4.TYPES OF AI CHIPS AI芯片类型
Artificial Intelligence chips being highly specific in their operations require a more specific and customized architecture that can perform complex computations and meet the computational paradigms in AI. Thus, to develop AI chips, two of the most used circuits are Field-Programmable Gate Arrays (FPGA) and Application Specific Integrated Circuits (ASIC). Both of these circuits can be used to model different types of AI chips depending on the application and the system specifications.
人工智能芯片的操作高度特定,需要一个更具体和定制的架构,可以执行复杂的计算,并满足人工智能的计算范式。因此,为了开发AI芯片,最常用的两种电路是现场可编程门阵列(FPGA)和应用专用集成电路(ASIC)。这两种电路都可以用来根据应用程序和系统规格建模不同类型的AI芯片。
4.1 ASIC as an AI Accelerator
Application Specific Integrated Circuit or ASIC is an integrated circuit that is designed to serve a highly specific purpose rather than performing several general-purpose operations [6]. Despite ASIC's high cost, they can be tailored to match the exact requirements of the product and reduce the need for integration of additional components.
The major advantages of using ASIC are they are small and hence reduce the use of excess components. ASICs consume less power as compared to other circuits and provide higher performance. As a large number of circuits are built on a single chip, these aides to make high-speed applications. They are also highly efficient in their specific application.
A major disadvantage of ASIC is since they are customized; their flexibility for programming is low. Since the chips are designed from the root level, their cost per unit is high. ASIC chips also have a larger time to market margin.
4.2 FPGA as an AI Accelerator
All processors are called integrated circuit i.e. electronic circuits integrated on a silicon chip. Usually, the circuit is fixed when the chip is designed. A field-programmable gate array or FPGA is a type of chip that allows the end-user to reconfigure the circuit after it has been designed by programming it as required. The FPGA constructs a logical circuit, which can be reconfigured by connecting or disconnecting different parts of the circuit engraved on the silicon chip. FPGAs consist of a number of programmable circuits that can independently perform a small amount of computation and contain a programmable interconnect that connects these programmable circuits together. Due to the array of programmable circuits, the FPGA can perform a large number of parallel operations [7].
The advantages of using FGPA as an AI accelerator are they consume less power as compared to CPUs and GPUs. It is cheaper to program an FPGA rather than design an ASIC circuit. An FPGA is cheaper as compared to the cost required to design an ASIC.
5.TENSOR PROCESSING UNIT张量处理单元
Tensor processing unit or TPU is an application-specific integrated circuit developed by Google for the Tensor Flow machine-learning library and for training of neural networks.
张量处理单元(TPU)是由谷歌开发的一种专用于应用的集成电路,用于张量流机器学习库和神经网络训练。
5.1 Model of Neural Network
Neural networks are parallel computing devices, which is an attempt to make a computer model of the brain. The main objective is to develop a system to perform various computational tasks faster than traditional systems. These tasks include pattern recognition and classification, approximation, optimization, and data clustering [9]. Figure 2 given below depicts the model of a neural network that can be trained using the tensor-processing unit.
A neural network consists of an input layer, output layer, and one or more hidden layers. The input layer consists of inputs (x1, x2, …xm) and randomly selected weights for the corresponding inputs. In case of training, the inputs remain constant through the network and hence can be read only once. The weights of the corresponding inputs keep updating per cycle and for every layer hence the weights need to be constantly read and updated. The inputs and their corresponding weights are multiple and added to get the resulting sum of products of inputs and weights. This result is then normalized using an activation function and based on the normalized result; the output of the neural network is generated at the output layer. Here the summation of products of inputs and their corresponding weights is similar to matrix multiplication. Hence, AI accelerators designed for training neural networks require higher matrix multiplication computing power and storing the weights in a memory buffer from where they can be read and updated quickly while the inputs can be stored in a buffer and can be read once. In case of inference, the activation values are updated for every layer while the weights remain constant for a batch. Therefore the activation values need to be stored in a unified buffer while the weights determined while training the model can be stored in a slow-off memory.
5.2 Architecture
The TPUv1 was designed for a high volume of low precision computations. The TPUv1 was used developed only for neural network inference since the architecture was designed to perform a large number of low precision computations over a small number of high precision computations.
The Host Interface in a TPU is used to communicate with the host system. In this scenario, the TPU will accelerate the process of matrix multiplications while the host system can take care of other general-purpose operations that need to be performed to train the model. There are three different forms of data present in the TPU: weights are stored in DDR3 Memory, the activations are stored in Unified Buffer from where they can be read and updated quickly, and the control instructions are stored in the Control Unit.
The host needs to access the Unified Buffer quickly to read the output of inference and to write new inputs for computation. As seen in figure 3, a major part of the chip's space (53%) is consumed by the unified buffer and Matrix Multiplication Unit (MXU) [10] [11].
5.3 Workflow of TPU
During start-up, the unified buffer and DDR3 storage is empty. The host machine loads the training neural network model on the TPU; the weights of the model are placed in the DDR3 memory.
The host system fills the input values (activations) into the unified buffer. The control unit sends a signal to fetch the weights and store it in the Matrix Multiply Unit (MXU). Before computation of the next batch, the weights are pre-fetched into the Weight FIFO, this ensures that the next set of weights is prefetched while computing the current batch.
When the host system triggers the execution of the inference engine, the input values and weights are loaded into the MXU and the output of matrix multiplication is propagated to the Accumulators. The MXU writes back the updated activations back to the Unified Buffer via the Accumulators and then through the Activation Pipeline. The activation function of a neural network is present in the Activation module. The output of MXU is accumulated and the normalized activation value for the input values is computed. This updated activation values replace the old values in the Unified Buffer.
The above steps are repeated for all the hidden layers present in the trained neural network model. The activation values obtained in the last layer are sent back to the host system via the Host Interface.
The control flow in the above diagram is marked in red. The control unit takes the instructions from the host and ensures that the instructions are executed in the correct order. The control will take care of operations like when the MXU should perform matrix multiplication, selecting the weights that need to be prefetched, the time when the weights should be fetched into the Weight FIFO and the operations that the activation pipeline needs to perform based on the activation function. Hence the control flow is responsible for the proper execution of operations on the chip it acts as a captain for the rest of the chip. The TPU outperforms the CPU and GPU in performing linear algebra computations.
6.CPU vs GPU vs AI Chip CPU vs GPU vs AI芯片
This section discusses the major differences between the various processors and it further explains the most suitable processor for developing AI systems based on the size of the system that needs to be designed.
本节讨论了各种处理器之间的主要差异,并根据需要设计的系统大小,进一步解释了最适合开发AI系统的处理器。
Thus, there is a tradeoff between flexibility and efficiency, in different processors. The CPU is a general-purpose processor that enables the user to perform a variety of operations through its efficiency is less. On the contrary, the AI Chips developed using FPGA/ASIC, can only perform the specific operation for which they are designed thus reducing the flexibility of the processor but these chips are highly efficient in performing complex machine learning computations [12].
Apart from flexibility and efficiency, another major factor is the performance of processors while developing machine-learning models. As seen in the diagram below, the TPU or AI Chip outperforms the CPU and GPU while performing predictions using a trained neural network model.
Another performance metric to compare the processors is the CPU can handle tens of operations per cycle, the GPU can handle tens of thousands of operations per cycle while the TPU can handle up to 128000 operations per cycle.
Choosing the ideal processor for developing an AI system is an important step since a large number of factors need to be taken into consideration. Performance, cost, dataset size, size of the model are a few parameters that need to be considered while selecting the ideal processor to develop a machine leering or AI system.
As seen in the paper, AI Chips have a huge potential of revolutionize the development of AI systems and deep learning models. AI Chips provide higher throughput for developing machine-learning models as compared to other processors. With the increasing demand for machine learning and deep learning, chip manufactures can draw more number of customers by developing chips that can perform compute-intensive operations in lesser time and provide higher efficiency. Thus, AI chips will have a huge demand in the near future with the increasing complexity of deep learning models.
正如文章中所看到的,人工智能芯片具有彻底改变人工智能系统和深度学习模型发展的巨大潜力。与其他处理器相比,AI芯片在开发机器学习模型方面提供了更高的吞吐量。随着对机器学习和深度学习需求的增加,芯片制造商可以通过开发能够在更短时间内执行计算密集型操作并提供更高效率的芯片来吸引更多的客户。因此,随着深度学习模型的复杂性不断增加,人工智能芯片在不久的将来将有巨大的需求。
留学生作业相关专业范文素材资料,尽在本网,可以随时查阅参考。本站也提供多国留学生课程作业写作指导服务,如有需要可咨询本平台。