When watching Apple’s Arm laptops event, I couldn’t help but wonder: where is all that performance coming from? During the event, there were two key points (that are not obvious, IMHO), that further explain where they are getting all that extra performance from.

But first, let’s mention the obvious stuff, that plenty of analysts will anchor on, and that are valid:

  • More transistors is better: 5nm process allows more transistors in a chip. More transistors = more instructions processed = faster performance
  • Small means low power consumption: 5nm process allows for lower power consumption. This is because, at that size, the transistor can operate between 0.7V - 1.2V per transistor[1] (instead of 1V - 1.35V)[2]. At the numbers of billions of transistors, a small voltage difference means a lot for battery consumption.
  • Specialized hardware gives an edge: Accelerators (like the neural engine) allow the Mac to offload specific tasks to another core, which leaves the main processor, M1 to do the work that is less specific (like compiling your code from XCode 😉 )

Unified Memory Architecture (or Making Friends with Physics)

When I was transitions careers to become a software developer, I came across this post about latency numbers of common computer components (a link at Peter Norvig post about it), it occurred to me that all these numbers were proportional to the distance between the CPU and the location of the data. The further the electrons need to travel, the more time they will take. This is a massive oversimplification but illustrates the point well. There are other details to take into account as well. The one that comes to mind is that having the electrical signal only traveling though the SoC instead of the printed circuit board, avoid a lot of signal conditioning that would need to happen to keep signal integrity. Apple brought all their chips closer, which reduced the amount of travel time for every critical signal, as well as reduced the number of medium transitions (from an integrated circuit to a printed circuit board, back to an integrated circuit) as shown in the figure below. This results in higher data access speed by the CPU and the GPU, which translates to a faster experience for the end-user.

Showing the reduced distance of signal traveling for a logic board vs. a SoC :-)

Showing the reduced distance of signal traveling for a logic board vs. a SoC :-)

Avoid Unnecessary Copies (or the gospel by a C++ performance engineer)

In my year of learning and using C++, I’ve learned that (unnecessary) copies are bad, and Craig Federighi also knows this. During the presentation, he says:

“We built macOS on Apple Silicon to use the same data formats for things like video decode, GPU and display, so there’s no need for expensive copying or translation”

Why are copies they bad? This has to do with what happens when an object is copied. Here’s a simple way to look at the process. Whenever a program issues a copy command, then the CPU needs to:

  • read the memory location of the data,
  • then reserve memory where the data will be copied
  • then copy the data
  • then decide what to do with the original data

When the source of the data copied remains, then the CPU is done. However, when the source data needs to be deleted, then there’s the additional step of de-allocating memory space so that other instructions can use it. This problem is sometimes caught by the compiler, whenever possible, and performs an optimization called copy elision whenever it is obvious that a copy is unnecessary. However, while compilers are awesome, they can’t undo a programmer’s lack of optimization knowledge. One solution that the C++ community found useful, is that we can help the compiler by specifying when copies are not needed. Since C++ 14, programmers can also tell the compilers that we don’t need to go through a deep copy, instead, use a command std::move, which does the magic of not destroying and recreate the same thing. This saves CPU cycles, and improves performance.

From the Apple M1 MacBook presentation, Apple makes the above behavior default: instead of having the programmer choose between copying or reusing data, reuse of the data is the default. This makes the efficient behavior to be the default, and save programs from being slowed down by unnecessary copies

Things I May Have Missed (and Things I Don’t Know About)

The above is a simplified view of the programming model of Apple’s new M1 chip. By not having programmed for one, the above is my best guess at how the items noted in the keynote mentions would translate in hardware and software. It’s possible that Apple simplified the programming paradigm of CPU vs GPU programming by obfuscating calls to GPU instructions in the macOS APIs. If all the hardware is one the same chip, the programmer doesn’t need to know what sub-chip (CPU, GPU, Neural Engine, etc…) is doing the job, but just that the chip (M1) is doing it. From there, the job is for compiler optimization to take over, and make the best decision as to which sub-chip needs to process the instruction.

— PS: if someone who knows the nitty gritty details of this, hit me up @kwiromeo on twitter. I would love to learn more about this.

[1] AnandTech - Early TSMC 5nm Test Chip Yields 80%, HVM Coming in H1 2020

[2] AnandTech - What Products Use Intel 10nm? SuperFin and 10++ Demystified