Add Metal backend for Apple Silicon inference#528
Open
binoculars wants to merge 1 commit intomicrosoft:mainfrom
Open
Add Metal backend for Apple Silicon inference#528binoculars wants to merge 1 commit intomicrosoft:mainfrom
binoculars wants to merge 1 commit intomicrosoft:mainfrom
Conversation
This commit adds a complete Metal (Apple GPU) backend for BitNet inference, enabling high-performance 1.58-bit quantized neural network execution on Apple Silicon (M1, M2, M3 series). Key features: - Metal compute shaders for int8×int2 matrix multiplication - 256-thread configuration (32×8) for optimal GPU utilization - PyTorch model wrapper with MPS fallback - Comprehensive profiling and testing utilities - Support for real BitNet models (tested with bitnet_b1_58-large) - Updated README.md with Metal support in model tables Performance results (bitnet_b1_58-large, 435M params): - Up to 24x faster than CPU - 900+ tokens/sec throughput on Metal - Verified with 24-layer real model Files added: - gpu/metal_kernels/: Metal implementation with README.md - utils/: Testing and profiling tools - Updated README.md with Metal support information Tested configurations: - Thread layout: 32×8 (256 threads) - Models: bitnet_b1_58-large, custom test models - Batch sizes: 1-16 - Sequence lengths: 1-512
|
@binoculars please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
Contributor License AgreementContribution License AgreementThis Contribution License Agreement (“Agreement”) is agreed to by the party signing below (“You”),
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add Metal Backend for Apple Silicon Inference
Overview
This PR introduces a complete Metal (Apple GPU) backend for BitNet, enabling high-performance 1.58-bit quantized LLM inference on Apple Silicon devices (M1, M2, M3 series). This significantly expands BitNet's platform support to include macOS and provides substantial performance improvements over CPU-only inference.
Motivation
BitNet currently provides excellent performance on x86_64 (AVX2), ARM64 (NEON), and CUDA platforms. However, there's no optimized backend for Apple Silicon Macs, which represent a significant portion of the developer and research community. This PR fills that gap with a production-ready Metal implementation.
Key Features
🚀 Performance
🔧 Implementation
📊 Testing & Validation
Performance Benchmarks
Real Model: bitnet_b1_58-large (435M parameters)
Technical Optimizations
Architecture
New Components
Integration
The implementation integrates seamlessly with existing BitNet infrastructure:
gpu/model.pyfor CUDAUsage
Basic Usage
Profiling
Compatibility
Documentation
docs/METAL_QUICKSTART.mddocs/inference_analysis.mdgpu/metal_kernels/README.mddocs/TEST_RESULTS.mdTesting
All tests pass with real model configurations:
Impact
For Microsoft/BitNet
For Users
Checklist
Future Work
Potential enhancements for follow-up PRs:
.metallib) for faster startupReferences
Ready for Review: This PR is production-ready and fully tested. All 22 files have been validated with real BitNet models.
Performance Verification: Tested with bitnet_b1_58-large (435M parameters, 24 layers) achieving up to 24x speedup over CPU inference.