Figure 5 shows the layout (after place-and-route) of an EIE processing element. The power/area breakdown is shown in Table II. We brought the critical path delay down to 1.15ns by introducing 4 pipeline stages to update one activation: codebook lookup and address accumulation (in parallel), output activation read and input activation multiply (in parallel), shift and add, and output activation write. Activation read and write access a local register and activation bypassing is employed to avoid a pipeline hazard. Using 64 PEs running at 800MHz yields a performance of 102 GOP/s. Considering 10× weight sparsity and 3× activation sparsity, this requires a dense DNN accelerator 3TOP/s to have equivalent application throughput.