Over the past few months, I've been working on a new side project: Honeycrisp, a deep learning framework written in Swift and designed for the Mac. My goal with Honeycrisp was to build a framework that felt truly different from the status quo—something more Apple.
For me, part of the fun of building Honeycrisp was being re-introduced to the Apple ecosystem. Earlier this year, I decided to learn Swift, the programming language of modern Mac and iPhone development. Meanwhile, after doing some research on Apple's specialized deep learning hardware, I couldn't resist buying myself a Mac Studio. This was the first Apple computer I've purchased in over a decade.
After buying a Mac, I knew what I was in for. I had mentally signed myself up to build a framework in Swift that I could use on this Mac for all of my future deep learning hobby projects.
The best way to test and improve something is to use it yourself, and that's exactly what I've been doing with Honeycrisp. For example, I recently trained a small text-to-image model on my Mac using Honeycrisp. With each new project I build on top of this framework, I expect to discover more missing features and bugs that I would have missed otherwise.






Why not MLX?
I haven't forgotten about MLX, Apple's own deep learning framework. In general, the MLX team has built something great, with well-optimized Metal kernels and a really clean codebase.
However, I don't feel that MLX is substantially different from the frameworks I've used on other platforms, and it doesn't feel like a real Apple framework. Just like frameworks built for other platforms, MLX is written in C++ and focuses on its Python front-end. It doesn't even support specialized hardware like the Apple Neural Engine.
While MLX does include Swift bindings, they are feature-incomplete1
and don't take full advantage of modern Swift features like
Tasks,
the
Codable protocol,
property wrappers2,
the ...
operator in subscripts,
and stack tracing macros like #file
and #line
.
In my opinion, the Swift front-end for MLX feels like any other Swift
bindings for a framework that was not built with Apple developers in mind.
In some of the following sections, I will compare Honeycrisp to MLX. I don't want my criticisms of MLX to be hurtful to the MLX team. Instead, I'd love to see these criticisms lead to useful discussions and tangible improvements.
A quick overview: tensors and operations
Before diving into how Honeycrisp works and what makes it unique, let's give a few examples of Honeycrisp code.
The fundamental data-type in Honeycrisp is a Tensor
, which is a multi-dimensional array of
data. As in other array libraries, every tensor has a shape that defines its size along each dimension.
For example, a 1-D array of 100 elements would have shape [100]
. A 2-D array could have shape
[2, 3]
to represent 2 as the outer (row) dimension and 3 as the inner (column) dimension.
When we create a tensor, we can do so with a combination of raw data and a shape:
// Creat the following matrix: // 1 2 3 // 4 5 6 let matrix = Tensor( data: [1, 2, 3, 4, 5, 6], shape: [2, 3] )
We can perform operations on a tensor using operators or methods:
let matrixPlus1 = matrix + 1 // Adds 1 to every element let sumOfColumns = matrix.sum(axis: 1) // Result: [6 15]
Every Tensor
also has a data type, i.e. the type of elements that it stores. In the
above
examples, we created tensors of integers, so the data type defaults to .int64
.
We can explicitly specify the data type when creating a tensor, and we can convert data types as
well:
let xFloat = Tensor( data: [1, 2, 3], shape: [3], dtype: .float32 ) let xInt = xFloat.cast(.int64) // Would fail with an error, due to mismatched data types: let sum = xFloat + xInt
Everything as an asyncronous task
Like in many other deep learning frameworks, all computations in Honeycrisp are performed
asynchronously.
In particular, when we perform an operation on Tensor
objects, we get a new
Tensor
immediately.
When we want an actual result, we need to use try await
to wait for
the computation to finish and handle any errors encountered along the way:
try await tensor.data // Get the raw data try await tensor.floats() // Get a [Float] try await tensor.item() // Get a Float
While asynchronous computation is a popular design decision amongst deep learning frameworks, it is uniquely
necessary
for a Swift-first framework due to how Swift implements error handling. In Swift, exceptions cannot
be raised from arithmetic operators like +
. Furthermore, every call to a method which
raises an exception requires an explicit try
keyword. Honeycrisp would be at a grave
disadvantage over Python frameworks if code looked like this:
// Hypothetical bad code without async computation let x = try add(a, b) // can't use + operator let y = try x.sum(axis: 1) let z = try layer2(z).gelu() let out = try z.item()
In comparison, we avoid this with asynchronous computation, by pushing error handling until the end:
let x = a + b let y = x.sum(axis: 1) let z = layer2(z).gelu() let out = try await z.item()
Asynchronous computation in Honeycrisp heavily leans on the Swift
concurrency model. The data of a Tensor
object is stored as a
Task<Tensor.Data, Error>
instance variable. When requesting the result of an
operation, we literally just await
the result of this data task.
One final thing to note is that some operations fail synchronously and abort program execution.
These are errors like shape or type mismatches that can be caught right away and are
considered user mistakes rather than runtime errors. This is also necessary because every
Tensor
knows its shape and data type immediately, even before computation completes, and Honeycrisp
does not provide a way to represent a tensor with an invalid shape or type.
Modular backends
Each Mac and iPhone has at least three different types of hardware acceleration for matrix multiplication: the GPU, the Apple Neural Engine (ANE), and special AMX instructions on the CPU. A deep learning framework should be able to leverage all three of these accelerators, possibly even concurrently in the same training job.
In Honeycrisp, operations are implemented inside of a Backend
, and different backends can
provide interfaces to different hardware. At runtime, we can swap out the backend even at the
granularity of individual lines of code in our model.
For example, we could mix the CPU and GPU like so:
Backend.defaultBackend = try MPSBackend() // Use the GPU by default let cpuBackend = CPUBackend() let x = Tensor(rand: [128, 128]) // Performed on GPU let y = cpuBackend.use { x + 3 } // Performed on CPU let z = y - 3 // Performed on GPU
Unlike in other frameworks, a Backend
can be implemented
natively in Swift, and Honeycrisp already includes the following backends:
- A backend which performs matrix multiplication on the ANE using CoreML.
- A backend for automatically counting the FLOPs of the operations performed on it. This backend can wrap any existing backend.
- A backend for running operations on the GPU with Metal Performance Shaders.
In theory, it is even possible to implement a backend for other GPU hardware, such as a CUDA backend that we could use on Linux.
In contrast to Honeycrisp, MLX only supports the CPU and GPU, and this choice is deeply ingrained
in the codebase. In particular, all operations (and there's many of them) have an
eval()
and eval_gpu()
method. As you can imagine, this design
makes it easy to add new operations but difficult to
add new backends (now you must add a new eval_XYZ
method to every operation). The
thing is,
new operations don't come along very often in deep learning, but in the Apple ecosystem
it seems that new accelerators come along fairly frequently. As a consequence, I'd argue
that MLX's design is the "transpose" of what it should be.
Automatic differentiation
When training neural networks, we typically rely on a deep learning framework to compute derivatives for us; this is called automatic differentiation.
Honeycrisp's low-level API for automatic differentiation looks a bit different than that found in other frameworks. In the next section, you will see that it's typically not necessary to use this API directly, but its implementation is still noteworthy.
During the backward pass, gradients are passed to each tensor via a callback.
We can register a callback using the Tensor.onGrad
method:
// The raw parameter data that we want gradients for. let xData = Tensor(data: [1.0, 2.0, 3.0]) // This is where we will store the gradients. var grad: Tensor? // Create a callback to set `grad` to the gradient. // The new tensor `x` has the same data as `xData` but // now does something during the backward pass. let x = xData.onGrad { g in grad = g } // Compute a loss function. let loss = x.pow(2).sum().sqrt() // Perform backpropagation to compute gradients. loss.backward() // We can now see the gradient: print(try await grad!.floats())
In most deep learning frameworks, automatic differentiation is implemented by explicitly tracking a graph of operations and then computing gradients through this graph. In Honeycrisp, there is no explicit graph tracking; instead, the graph is implicitly tracked by capturing variables in Swift blocks.
We can see how this works by looking at the internal implementation of a differentiable operation.
First, if our operation depends on an input, we create a "handle" to this input by calling
input.saveForBackward()
. We then create a block (Swift's version of an anonymous
function)
that implements the backward pass for this operation. This block captures the input handle by
calling
handle.backward()
somewhere inside. Here is a simplified example:
public static func - (lhs: Tensor, rhs: Tensor) -> Tensor { let backend = Backend.current // The task-local backend that is being used let newData = createDataTask(lhs, rhs) { lhs, rhs in // Simplified, but basically true: we use the backend to compute resulting tensor data. try await backend.someMethod(lhs, rhs) } if !Tensor.isGradEnabled || (!lhs.needsGrad && !rhs.needsGrad) { // Our result does not require gradients. return Tensor(dataTask: newData, shape: outputShape, dtype: lhs.dtype) } else { // Create references to the arguments. let lhsHandle = lhs.saveForBackward() let rhsHandle = rhs.saveForBackward() return Tensor(dataTask: newData, shape: outputShape, dtype: lhs.dtype) { grad in // This block captures lhsHandle and rhsHandle and will release them // when the resulting tensor is released. lhsHandle.backward(backend) { grad } rhsHandle.backward(backend) { -grad } } } }
During the backward pass, gradients for a tensor are accumulated each
time a handle's backward()
method is called. Once all of a tensor's handles have been
released, the tensor runs its own backward block on the accumulated gradient. If we create
a tensor that is not used, then the handles that this tensor created for its backward implementation
will automatically be released when the tensor itself is released
(thanks to the automatic reference counter).
Declaring neural networks
In frameworks like PyTorch, it is typical to define our neural network as a nested hierarchy of modules (which are themselves just classes). Among other benefits, this makes it easy to keep track of all the neural network's learnable parameters in one place.
In particular, we want each class to
automatically keep track of its own parameters, as well as all of its children with their own nested
parameters. In Honeycrisp, we can do this by subclassing Trainable
and using property
wrappers:
class MyModel: Trainable { // A parameter which will be tracked automatically @Param var someParameter: Tensor // We can also give parameters custom names @Param(name: "customName") var otherParameter: Tensor // A sub-module whose parameters will also be tracked @Child var someLayer: Linear override init() { super.init() self.someParameter = Tensor(data: [1.0]) self.otherParameter = Tensor(zeros: [7]) self.someLayer = Linear(inCount: 3, outCount: 7) } func callAsFunction(_ input: Tensor) -> Tensor { // We can access properties like normal return someParameter * (someLayer(input) + otherParameter) } }
One really nice thing about this example is that our model code can use someParameter
as
if it were a regular Tensor
. However, it's not! In reality, it's actually a property
wrapper of type Trainable.Parameter
which accumulates gradients automatically during
the backward pass. If we want the gradient explicitly, we can use projected values, like
$someParameter.grad
.
In practice, we typically won't access the parameters directly this way. Rather, every
Trainable
automatically implements a parameters
property
that returns a mapping from parameter names to parameter objects. We can pass this mapping directly to
an optimizer and it will manage gradients for us:
let model = MyModel() // Create an optimizer that references the model's parameters let optimizer = Adam(model.parameters, lr: 0.001) let loss = someLossFunction(model) loss.backward() optimizer.step() // Automatically updates the data inside of each parameter optimizer.clearGrads() // Clears out the grad of each parameter
Serializing models
The Swift standard library makes it easy to encode objects into various formats; all you need to do is
implement the Codable protocol. In
Honeycrisp, we can save and restore
tensors by using the TensorState
class, which implements Codable
for us.
// Example model state which implements Codable. // Note how we use TensorState instead of Tensor. struct MyModelState: Codable { let stepIndex: Int let myTensor: TensorState } // Example of a Tensor that we'd like to save/restore. let myTensor = Tensor(data: [1, 2, 3], shape: [3]) // Create the Codable object to save. let state = MyModelState( stepIndex: 1000, myTensor: try await myTensor.state() // Get a TensorState ) // Encode as a binary property list let data = try PropertyListEncoder().encode(state) // Decode the binary property list let decoder = PropertyListDecoder() let loadedState = try decoder.decode(MyModelState.self, from: data) // Turn the TensorState back into a Tensor let myTensorLoaded = Tensor(state: loadedState.myTensor)
In the example above, we used Swift's built-in support for Property List files. A really nice feature of this particular format is that Xcode provides a GUI to view and edit it.
Want to change the name of some parameter? Just double-click to rename. Want to remove the optimizer state from a final checkpoint before shipping the weights? Simply delete a row with the click of a button. This might seem like a pretty silly thing to be happy about, but it really does save some otherwise annoying scripting.

Errors with stack traces
I've used Python for the majority of my deep learning career. When a Python program crashes due to an exception, it will automatically print a stack trace. This is very helpful for debugging large experiments, especially ones that are running across multiple machines. In this scenario, looking at logs is often the easiest (and most effective) first step of debugging.
With this in mind, I was very surprised to find out what happens when you run into an error in a framework like MLX for Swift. Instead of a helpful stack trace, your program prints a small, out-of-context error message before terminating. If you want to dig into the cause, you'll have to rerun the program with a debugger attached.
To give an example, here's the output from a shape error:
MLX error: Shapes (500000,2,1) and (1,1000000,1) cannot be broadcast. at /Users/alex/Library/Developer/Xcode/DerivedData/TryMLX-aoumphwiplvicwcvxxqqursdoazy/SourcePackages/checkouts/mlx-swift/Source/Cmlx/include/mlx/c/ops.cpp:24 Program ended with exit code: 255
Similarily, if you run into an out-of-memory error:
libc++abi: terminating due to uncaught exception of type std::runtime_error: Attempting to allocate 4000000000000 bytes which is greater than the maximum allowed buffer size of 17179869184 bytes.
In Honeycrisp, we can do much better. Even though Swift doesn't have native support for printing stack traces, it provides a nifty trick. Using default arguments and built-in macros, a function can observe the code location that called it:
func myFunction(..., file: StaticString = #file, line: UInt = #line) { // Push the caller (file, line) to the stack trace. ... // Pop the last caller from the stack trace }
It would be pretty tedious to manually apply this trick to every function and
method throughout a codebase. Luckily, we can use macros to do most of the work.
Every method in Honeycrisp is wrapped with a @recordCaller
macro, which
automatically adds arguments to the method and wraps the function body in a stack trace
push/pop.
The end result is that any error in Honeycrisp includes a stack trace. This stack trace may be incomplete for a few reasons, but it will typically at least indicate the external code location that triggered an error within Honeycrisp. This is often exactly what we need for debugging.
For example, here's a shape error in Honeycrisp. Note that the stack trace begins with some
user code in Entrypoint.swift:37
, which is the line where a Honeycrisp
method was called with badly shaped tensors.
HCBacktrace/Backtrace.swift:182: Fatal error: Traceback: run() at Entrypoint.swift:37 _add(_:thenMul:) at .../.build/checkouts/honeycrisp/Sources/Honeycrisp/FusedOps.swift:44 _lazyBroadcast(_:) at .../.build/checkouts/honeycrisp/Sources/Honeycrisp/Broadcast.swift:205 _broadcastShape(_:) at .../.build/checkouts/honeycrisp/Sources/Honeycrisp/Broadcast.swift:304 Fatal error: shapes [50000, 2] and [1, 100000] do not support broadcasting
Even when an asynchronous task raises an error, for example due to an allocation
failure in a backend, this error will include the context of the original
Tensor
operation which created the task.
Error: Error at: run() at Entrypoint.swift:38 _add(_:thenMul:) at.../.build/checkouts/honeycrisp/Sources/Honeycrisp/FusedOps.swift:51 _createDataTask(_:_:_:_:) at .../.build/checkouts/honeycrisp/Sources/Honeycrisp/Tensor.swift:519 Error: allocationFailed(4000000000)
Unfortunately, there is one annoying limitation of stack traces in Honeycrisp.
At the moment, overloaded arithmetic operators like
+
do not support tracking the caller's code location.
As a result, stack traces will miss the caller of these operators.
I truly hope this hole in the Swift language gets filled soon!
Indexing operations
We can use the subscript operator on a tensor to select subsets of it. For example, we can lookup a given coordinate in a matrix like so:
let matrix = Tensor(data: [1, 2, 3, 4, 5, 6], shape: [2, 3]) matrix[1, 2] // value is 6 matrix[0, -1] // value is 3 matrix[-1, 0] // value is 4
Note that these indexing operations actually return another Tensor
, in this case of shape
[]
(i.e. a single scalar value).
Indexing operations need not return a 0-dimensional tensor. We can also use them to slice out a sub-tensor:
let firstRow = matrix[0] // shape: [3] let firstColumn = matrix[..., 0] // shape: [2] let lastColumn = matrix[..., -1] // shape: [2]
We have seen that we can index a tensor using tensor[index]
, but
I never elaborated on the type that index
can have.
In Honeycrisp, you can use any index that implements the TensorIndex
protocol.
In practice, this protocol is implemented for various helpful types already. For example,
Range<Int>
can be used to slice a range of an axis, and helper types like
PermuteAxes
can be used to rearrange entire chunks of data in a tensor.
Furthermore, we can combine multiple different indices along different axes using
commas within the brackets:
// Take the first row, and the first three columns x[0, ...2] // Swap the second and third dimension x[PermuteAxes(0, 2, 1)] // Skip the first two rows, and swap the second and third dimension: x[2..., PermuteAxes(0, 1)]
Performance
While I spent most of my time implementing features to make Honeycrisp pleasant to use, I did spend a bit of time tuning performance. For example, I implemented Metal kernels for fused log-softmax, normalization, Adam updates, and a few other operations. However, I am not a kernel connoisseur, and I leaned heavily on Metal Performance Shaders for more difficult operations like matrix multiplication and convolution.
As a consequence of leaning on MPS, Honeycrisp will typically underperform MLX on workloads
that are heavy in GPU matmuls, as MLX includes optimized kernels for this.
In contrast, when compared to PyTorch, Honeycrisp can sometimes be faster, especially when using a
Backend
that supports offloading some computation to the Apple Neural Engine.
However, one thing to keep in mind is that Honeycrisp allows joint execution on the ANE and the GPU. I've been benchmarking it on a Mac Studio with a good GPU, but many devices have slower GPUs. On these devices, the ANE is still just as fast, so it's quite possible that Honeycrisp is already the fastest option for some workloads on these devices.
During development, I focused primarily on model training, and for the most part neglected inference performance. If you want to sample from large language models or diffusion models, you will probably be better off using a framework with support for quantization and other inference-time optimizations. However, I do plan to implement some backends for Honeycrisp that convert the operations to other frameworks like MLX or CoreML, and this could be a potential avenue for writing models in Honeycrisp and then shipping them for faster inference.
Using some new tools
In principle, I don't like using Apple products. I wanted to do all of my development on a Linux laptop, while still being able to test and run code that only runs on macOS. I ended up using Tunnels in Visual Studio Code, which allows me to use VSCode on my laptop as a front-end for a server running on my Mac Studio. All VSCode extensions run directly on the Mac, allowing them to compile code, provide error highlighting, etc., while still working seamlessly on my laptop. Of course, this developer experience has some flaws, the biggest of which is that everything is terrible on high-latency networks (e.g. on an airplane).
Another essential piece of this puzzle was setting up VNC (for remote desktop access) and WireGuard.
It was super useful to be able to VNC directly to 10.9.0.4
from anywhere on any of my
devices, and know that I will be met with the Mac desktop. Without the ability to VNC into the Mac,
there are certain things which were completely broken. Want to run some code in lldb
over SSH? Well, it will just hang. But if you actually look at the GUI of the Mac, you will notice
that a popup has appeared asking the user for permissions to run developer tools. Similar things
happen for accessing files on the Desktop, or on external drives.
Conclusion
I want to emphasize that I am one person, and everything I just talked about is part of a solo hobby project. By no means am I claiming to have created the best—or even a good—deep learning framework. This project was all about learning, both about Apple development and about deep learning frameworks in general.
With that being said, I am truly proud of what I've built so far, and I don't intend to stop and jump to some other framework any time soon. I'm excited to see if anybody else wants to give this framework a try. If you do, be sure to check out the honeycrisp-examples repo, which might even spark some inspiration for a first project.