We distribute the matrix and parallelize our matrix-vector computation by interleaving the rows of the matrix W over multiple processing elements (PEs). With N PEs, PEk holds all rows Wi, output activations bi, and input activations ai for which i (mod N)=k. The portion of column Wj in PEk is stored in the CSC format described in Section III-B but with the zero counts referring only to zeros in the subset of the column in this PE. Each PE has its own v, x, and p arrays that encode its fraction of the sparse matrix.