Skip to content
/ DistConv Public

LBANN Distributed Convolutions for parallelizing over large spatial domains

License

Notifications You must be signed in to change notification settings

LBANN/DistConv

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LBANN: Livermore Big Artificial Neural Network Toolkit

The Livermore Big Artificial Neural Network toolkit (LBANN) is an open-source, HPC-centric, deep learning training framework that is optimized to compose multiple levels of parallelism.

LBANN provides model-parallel acceleration through domain decomposition to optimize for strong scaling of network training. It also allows for composition of model-parallelism with both data parallelism and ensemble training methods for training large neural networks with massive amounts of data. LBANN is able to advantage of tightly-coupled accelerators, low-latency high-bandwidth networking, and high-bandwidth parallel file systems.

DistConv Repository

The DistConv repository contains a a rewrite of the original DistConv algorithm, published using the LBANN C++ Core, with a reimplmentation using PyTorch 2.x DTensor objects.

Publications

  • Yosuke Oyama, Naoya Maruyama, Nikoli Dryden, Erin McCarthy, Peter Harrington, Jan Balewski, Satoshi Matsuoka, Peter Nugent, Brian Van Essen. "The Case for Strong Scaling in Deep Learning: Training Large 3D CNNs with Hybrid Parallelism", under review for Special Session on Parallel and Distributed Computing Techniques for AI, ML and DL in Transactions on Parallel and Distributed Systems, July 2020.

  • Nikoli Dryden, Naoya Maruyama, Tom Benson, Tim Moon, Marc Snir, Brian Van Essen. "Channel and Filter Parallelism for Large-Scale CNN Training", in SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, November 2019 Article No. 10, Pages 1-20, DOI: 10.1145/3295500.3356207.

    @INPROCEEDINGS{8820780,
      author={N. {Dryden} and N. {Maruyama} and T. {Benson} and T. {Moon} and M. {Snir} and B. {Van Essen}},
      booktitle={2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)},
      title={Improving Strong-Scaling of {CNN} Training by Exploiting Finer-Grained Parallelism},
      year={2019},
      volume={},
      number={},
      pages={210-220},
      doi={10.1109/IPDPS.2019.00031}}
    @INPROCEEDINGS{8820780,
      author={N. {Dryden} and N. {Maruyama} and T. {Benson} and T. {Moon} and M. {Snir} and B. {Van Essen}},
      booktitle={2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)},
      title={Improving Strong-Scaling of {CNN} Training by Exploiting Finer-Grained Parallelism},
      year={2019},
      volume={},
      number={},
      pages={210-220},
      doi={10.1109/IPDPS.2019.00031}}

A complete list of LBANN related publications, presentations and posters are shown here.

Reporting issues

Issues, questions, and bugs can be raised on the Github issue tracker.

About

LBANN Distributed Convolutions for parallelizing over large spatial domains

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •