Improving density in circuit design is an ongoing challenge. One solution is to reconsider circuit layouts from the perspective of bandwidth optimization.
A 14-core parallel run is used in this case. You can change the number of cores by editing the system/decomposeParDict script.
Abstract: In-network aggregation (INA) accelerates gradient aggregation in distributed machine learning (DML) by alleviating communication bottlenecks, but its effectiveness crucially depends on two ...