Selected for HIVE Рroject Collaborations are Following: Intel Corporation (Santa Clara, California), Qualcomm Intelligent Solutions (San Diego, California), Pacific Northwest National Laboratory (Richland, Washington), Georgia Tech (Atlanta, Georgia), and Northrop Grumman (Falls Church, Virginia). “The HIVE program is an exemplary prototype to forge new R&D pathways that can deliver unprecedented levels of hardware specialization and to build the software infrastructure. The new software must not only be compatible with the new hardware, but also with the existing CPU and GPU. In addition, the new software must support a large amount of existing software used in the data science world. Part of the goal is to be able to connect existing software and libraries to the HIVE software framework to make it easier to use.
Keywords: Mass-Multithreaded Architecture; Hybrid Microprocessor; Data Flow Control; Graphical Representation of Programs; Specialized Accelerators; Massively Parallel Systems
As described by DARPA, main HIVE goal is creation of a “graph analytics processor” which incorporates the power of graphical representations of relationships in a network more efficiently than traditional data formats and processing techniques according to DARPA. In combination with emerging machine learning and other artificial intelligence techniques that can categorize raw data elements. By updating the elements in the graph as new data becomes available, a powerful graph analytics processor could discern otherwise hidden causal relationships among the data elements in the graph representations . DARPA suggests such a graph analytics processor might achieve a “thousand-fold improvement” in processing efficiency over today’s best processors, enabling the real-time identification of strategically important relationships as they unfold in the field rather than relying on after-the-fact analyses in data centers. Current software includes algorithms exposed via API, internal graph representation of data and hardware “backends” (GPU, CPU, ASIC). Chinese specialist Wang explained that under the current AI platforms major tradeoffs must be made, whether it is optimizing for a certain type of hardware or optimizing for certain algorithms. He said: “If you specialize in graph representation, then you will be disconnected from the data science ecosystem because you are “cut off” from some important libraries” . As part of the second phase of HIVE, they are developing a modular architecture framework. Existing software is being reintegrated into their components so that they can be inserted into the framework according to the best features of the software.
The structure includes Workflow Scheduler and Dispatch Engine. They use the DASK task scheduler to perform scheduler and dispatch operations. In fact, they are fully targeted at a wide range of hardware so that data scientists can immediately span CPUs, GPUs and ASICs to take full advantage of the same software infrastructure. So, we can use PUMA graph processor to accelerate the load processes. Framework must include a set of converters that can convert data in different formats [2,3].
In Russia we also promote proactive development of domestic
massively parallel processors. Objectives of the project also include
information and analytical work and the development of technical
solutions for creating high-speed element and design base. The basic
component of the processor is a tile formed by a 64-thread core,
connected by specialized accelerators (SFU). The massively parallel
processor must include connected by an on-chip network hundred
tiles, several links of an on-chip interaction, PCI-e interface with the
host processor. The ideology of massively parallel architecture is
similar to processor “Colossus” (company Graphcore) , focused
on machine learning tasks. But the domestic massively parallel
systems are hybrid and reconfigurable platforms. Now we will
look at the basic principles for the development of innovative AI
systems. DARPA project includes five key technologies:
a) Intelligent scheduling to manage the set of domain resources in the context of specific applications. This should be a technical focus of this project, and it is also the basis for the realization of the project’s goals. Only by realizing runtime intelligent scheduling can software and hardware be decoupled. The performance of smart scheduling is directly related to the efficiency of the final system. For the implementation of this part, there are some further discussions in the technical description, such as adding a processor dedicated to scheduling in the system, allowing each PE to analyze and report its own running status.
b) Software tools to enable a development ecosystem that exercises the full capability of the highly programmable system. Software tools include development tools, such as compilers and debuggers, algorithm libraries, applications and examples. Software tools should be the most important part of whether specific calculations can truly be implemented. A basic requirement is that they can also support new designs in the field. DARPA strongly encourages the use of existing software tools and also supports Open-source work.
c) Forming domain representations as ontologies. This is a job worthy of attention, how to describe a “Domain”. For Domain-Specific software and hardware design work, the premise of the design is a good description of the characteristics of this field. However, these definitions are still too simple to reflect the relationship between different groups.
d) Medium access control (MAC) to interconnect the PEs and to allow the data throughput, taking into consideration latency, power, and other domain constraints. In HIVE project the MAC layer is the medium for exchanging data structures between PEs, including physical structure, topology and a standard programming interface.
e) Hardware integration of the right set of PEs on the MAC layer with the operating scheduler and software into a fabricated socket. The final result of this project also includes a heterogeneous SoC hardware, but this does not seem to be the focus of this project. According to the description of this project, PE is a concept with a relatively large granularity, such as CPU, GPU, TPU, Neuromorphic Unit, DSP and hardware accelerators.
In the field of dedicated processor and accelerator design
the performance competition is fierce. It has even become an
international-level scoring contest. Usually, international high-level
publications (such as ISSCC) publish the latest “scoring results”,
such as the performance comparison result of a CNN acceleration
chip shown in Figure 1 . The scoring result does not represent
an absolute advantage, but it also becomes a measure of whether
the design can be recognized as a good design. But the benchmarks
obtained from the references are not completely comparable,
because each design has its own special features. In the article 
author hopes that readers can jump out of the thinking mode of
“mutual scoring” and get the best design possible in theory (from
the point of view of extreme design).
Analysis is traditionally divided into four parts:
a) Performance indicators of digital logic chips.
b) Checks and balances between indicators and trade-off methods.
c) Limit quantitative design method.
d) Investigation and analysis of the reasons for not reaching the optimal performance.
When we analyze chip architecture, we mainly need to consider the following indicators, as shown in Figure 2 :
Single Standard (Orange Part)
a) Power consumption: static power consumption, dynamic
b) Energy consumption: the energy for processing a task.
c) Throughput: such as communication speed (bps), images per second.
d) Delay: the time difference between input and output. For real-time systems, latency is very critical.
e) Effective computing capacity: when performing a certain calculation, the total number of operations per second is calculated.
f) Peak computing capacity: the sum of computing operations per second in the chip structure.
g) Clock frequency: adjustable clock frequency/voltage, peak clock frequency.
h) Quantization accuracy: the data quantization width of input/output/intermediate levels.
Normalization Standard (Purple Part)
Due to the checks and balances between indicators, an increase in an expected indicator may lead to an increase in other undesired indicators, so it needs to be measured by a normalized scale. This is often related to the application of the chip, such as: computing power normalized to power consumption; throughput normalized to energy; computing power normalized to the clock frequency and etc.
Comprehensive Results (Light Blue Part)
In the preliminary evaluation or sub-assessment, some comprehensive key indicators given by the comprehensive tool. These indicators are based on netlists and therefore do not include connections and can only be used for initial chip cost evaluation. Including: combinatorial logic resource cost, non-combination logic resource cost; device cost (such as RAM), combinatorial logic cost can also be measured by the number of equivalent gates. The difference is that the combinatorial logic overhead is related to the process library, while the number of equivalent gates is independent of the process library.
Memory Related (Dark Blue Part)
Memory is often represented as a black box. The number of memory chips, the capacity, depth and width of each memory chip. Whether it is dual-port or single-port (the size of dual-port is about twice as large as single-port). For small-scale memory less than 1KB or memory with many read and write ports, it is often implemented by register file size. When a smaller memory is needed or a storage device with a complex read-write interface is required, the register array is generally used to build it directly. The difference between the realization of the register array and the register file is that the register file often needs to be realized by a dedicated compiler, while the register array is directly realized by the hardware description language. The area of the register file will be much smaller than the equivalent function of the register array.
Interface and Bus Related (Grey Part)
Including the process library used, the number of wiring layers, operating voltage, low power consumption or high-performance process and etc.
Design Cost and Special Features (Pink Part)
a) For engineering research and development, it is also
necessary to consider the software/application development
cost and the ability to map new algorithms and parameters:
Time to market (TTM), non-refundable engineering expenses
(NRE) and etc.
b) Some chips tend to have advantages in other areas, such as: boot time optimization, non-volatile, no fan requirements (less than 4W), dynamic voltage frequency adjustment, gated clock and other functions.
Similarly, integrating a new user API only needs to add an interface to one of the hardware and use at least one algorithm. Ultimately, the overall goal of HIVE is to unify and simplify the process of “optimizing the communication between graph software and hardware”. Reconfigurable hardware often needs to support multiple operating modes or operating parameters. Configurability and programmability are ultimately the need to add some redundancy to the original dedicated circuit to improve flexibility. But the efficiency of “ordinary” processors is not consistent. With proper ASIC design we can demonstrate good indicators for a limited class of “hard” logic algorithms. For a successful “breakthrough in the field of microelectronics” it is necessary to start developing a mass-parallel processor based on of multithread cores with specialized accelerators.
- Joel Emer, Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang (2021) Tutorial on Hardware Architectures for Deep Neural Networks.
- Benchmarking DNN Processors.
- Williams Samuel, Waterman Andrew, Patterson David (2009) Roofline: An Insightful Visual Performance Model for Multicore Architectures. Commun ACM 52 (4): 65-76.
- Wu Zhenzhi. The chip design method in pursuit of extreme performance.
- Introducing the Colossus™ MK2 GC200 IPU.