This is a very simplified implementation of a CNN (Convolutional Neural Network) using C++ and CUDA for GPU acceleration. This implementation is intended for educational purposes and may not be optimized for performance or accuracy. Additionally, some functionalities were omitted for simplicity - proper data loading, CLI and metrics, inference and persistence, etc.
I have tried to squeeze out as much performance as possible without making complex optimizations, however that came with some trade-offs due to my limited time - for example, there is a huge memory footprint because of improper iterator for data loading and copying in layers.
To simplify the data handling, I have decided to use templates for most of the cuda code, should make it much easier to handle different data types out of the box.
To compile and run the code, you need to have a CUDA-capable GPU and the CUDA toolkit installed. You can follow these steps:
- Clone the repo. Modify main.cu to change hyperparameters or dataset paths if needed (note that I might use slightly different name for unpacked MNIST data).
- Compile using CMake (I recommend opening this project in CLion, since it handles most of the stuff automatically,
don't forget to use proper compiler with CUDA included):
Most likely you can use nvcc instead of CMake, I haven't tested it though.
cmake --build ./cmake-build-<build-type> --target CNN_CUDA -j $(nproc)
- Run the executable:
./cnn_cuda
Ensure you are using MNIST dataset files in the same directory as the executable or modify the paths in the code accordingly.
I haven't added any CLI arguments, or proper inference because I have wanted to implement the core CNN in CUDA as a small challenge, I don't really think it's that important for this demo. Same about block_size and grid_size, they are hardcoded in most of the places, which is a subject for future improvement.