Mac + ARM: one more thing

11 Aug

I’ve written appreciatively of Apple’s vertical integration and also about their Ax architecture, noting that the Imagination graphics processor could be readily boosted within a plugged-in device like the Apple TV to deliver console-quality games. M. Gassée’s MacIntel: The End Is Nigh and Mr. Richman’s Apple and ARM, Sitting In A Tree suggest many potential benefits in supply-chain control, overall cost, and battery life to an even deeper “Mac+ARM” vertical integration strategy which would shift Macs to use ARM processors instead of Intel’s more expensive, power-consuming x86 architecture. There are plenty of arguments to be made against such a shift to ARM, but there is a subtle trend at play to Mac+ARM if you are a programmer that it seems to me could make such a shift a compelling and strongly differentiable position. It’s an argument I haven’t heard from anybody else. It’s about how we improve the performance of this software stuff that’s eating the world.

Game performance as an example

To understand the situation it’s worth taking a quick look at game software as an example. Today a great deal of game software is built against software engines like Source, Unity and Unreal that can detect different graphics hardware and adjust how a game runs and also ease cross-targeting different operating systems (Windows, Mac, PC), mobile devices, and consoles from the same source code and game artwork. The way game software runs atop game engines often allow existing games to “look better for free” or with minimal added effort on newer devices with better graphics processors (GPUs) and to degrade gracefully on older devices. The same software can run at faster frame-rates, so animations are more pleasing and buttery. The same software renders more detailed, higher-resolution content with finer textures and richer special effects in lighting or smoke or fire or fog. The same software may support more game-controlled opponents behaving more realistically, adding complexity and realism to the game. Just by updating the graphics processor, the GPU.

Two reasons this happens so strikingly with games and GPU hardware are (1) there is a natural parallelism to graphics rendering itself, but also importantly (2) the natural way to organize game software and the engines around that natural rendering parallelism is a very approachable concurrent programming model for programmers. Upgrade a GPU with more parallel “core” elements which can render more triangles, more complex textures, more lighting effects and perform more physical simulation every 30th or 60th of a second… the software “logic chunks” of the game software have already been split by programmers into small pieces that these GPU cores and the game engine can parallelize to do more each frame. Newer GPUs can often take the same source artwork and render more detailed characters or draw texturing images with more fidelity on more of the screen or to a bigger screen, or realize a more realistic lighting or fog effect which was already part of the software’s definition. (A similar type of free speedup effect seems afoot with iOS8’s Metal.)

“Free speedup” isn’t a new thing in software; it’s a natural consequence of Moore’s Law, after all. Free speedup happened for gaming and non-gaming software on the PC/Wintel platform through about 2002, almost entirely due to increases in clock speeds (what you see as the Gigahertz, or GHz, of your computer – it’s a measure of how many billions of individual operations like adding, subtracting, or moving data around your processor can do every second). From the birth of PCs until the early 2000’s newer machines arrived with speedier processors as well as more and faster memory and other faster system components.  New PCs automatically improved the performance of existing software like your operating system and the few apps you were using: a browser, Word, Excel, Adobe Photoshop, etc. Each year a new PC felt like a dramatic performance improvement for you as a user, and the easily perceived productivity improvements helped drive rapid PC replacement cycles.

In the early 2000’s we began reaching the “thermal limit” of the processor speed race – we couldn’t make a single processor run at faster speeds without literally melting your laptop. Intel began adding additional processors to a single chip and using techniques such as “hyperthreading” to add even more “virtual” processors. Instead of having a single processor chip running at 4GHz we have a single chip which has 2, 4 or 8 2GHz processors. Having more real and virtual processors made the operating system more responsive and also gave “free speedup” benefits to PCs being used in server environments. Database servers and web servers often run identical software fragments for every user connected to them, every time a request for a web page or a piece of data comes in; this is the same kind of naturally parallelizable software that gets a free ride from having more processors even when the speed of any single processor is not much faster.  Unchanged, plenty of server software can handle more simultaneous users or web-connections or database queries on hardware with more processors, without most high-level programmers having to do a complex rewriting to accommodate the change. Their software was already written for concurrency.

Alas, most desktop software hasn’t gotten as much of a free ride in the last decade from multiple processors or multiple cores. You experience some speed improvements when you buy a new machine due to faster graphics, more memory, faster solid-state disks, or a faster network, but not like you used to in the 90’s. It’s telling that people are more excited these days by a faster internet connection than by a brand new laptop! It turns out that under the covers our apps and their user-interfaces are built using not-very-concurrent software programming techniques, so those extra processors alone don’t make your favorite apps feel much faster. Unless you are a programmer running many tools at once (like me!) or you work with specific high-end media software which has been painstakingly re-written to take advantage of all those processors, not much free speedup for you. Personally I think this lack of perceived speedup may account for some part of the decline in the PC industry – slower replacement cycles and less consumer desire to upgrade because there is no obvious benefit to a new machine.

Here’s a big part of why this happened:

The traditional software programming technique for concurrency, for taking advantage of multiple processors, is called multithreading, where the work of your software is manually broken into smaller pieces which are given to different processors to work on. In school every programmer is taught about threading and suffers through logic tests about semaphores, mutexes and other mind-numbing locking and synchronization techniques. It turns out that although low-level developers of operating system kernels, database engines, web-servers and some games and game-engines can pull off this form of concurrent programming to get the most out of multiple processors, most programmers (the ones building all your apps) are easily confused by threading and either can’t get it to work properly or can’t get it to work well when there are many actual threads running on many actual processors. Programmers don’t do the multithreading work or don’t do it well; most apps don’t feel much faster. Edward E. Lee’s famous 2006 “The Problem With Threads” pointed out that basic threads “discard the most essential and appealing properties of sequential computation: understandability, predictability, and determinism” and he suggested new programming language techniques to facilitate concurrent programming, to make it easier for programmers to do well.

As a response to these trends and then in response to the very pronounced performance impact of long-running and slow-network-constrained apps on mobile devices – things like the lag of touchscreens and unresponsiveness of buttons and lists if software is not concurrent enough -various platforms introduced new concurrent programming features around this time in an attempt to push new software into taking advantage of a future with many processors. Some of them seem to have taken Mr. Lee’s insights to heart about simplifying concurrency for programmers. Others did not.

Java introduced java.util.concurrent in 2006 with some useful queuing and “futures” features, but also with many simplistic and mostly not useful wrappers around the traditional complex threading models. As part of its response to sluggish UI compared to iOS, Android followed up in 2010 with the addition of AsyncTask as well as guidelines for programmers to “do more work” in separate threads. In my opinion, Java and Android have not taken Mr. Lee’s insight very deeply to heart. Programmers can use some concurrent programming techniques, but concurrent programming is not the norm.

In mid-2010, Microsoft introduced Parallel Extensions to its .NET platform and runtime, then more “completion-based” and quasi-asynchronous APIs for Windows Phone through 2011 to prevent long-running operations from causing UI stutter and hangs, and finally introduced new await/async keywords in the mid-2012 C# 5.0 update. I think Microsoft folks definitely took the global trends and Mr. Lee’s insights to heart when building PLINQ and TPL, but their lack of platform focus & consistent messaging in the past few years and troubles with Windows Phone market share has meant that their concurrent programming model has not caught on deeply with developers. Also, although many developers love and use C#, the concurrent programming model does not permeate all of the many disjointed Microsoft APIs and so software is not yet being broken up to take strong advantage of the future with many processors.

Apple’s iOS launched with the iPhone in 2007, then to developers as an SDK and platform in early 2008. It arrived with a great deal of natural concurrency over its entire API surface. Not just guidelines for which APIs to use when or an admonishment to add threads for long-running tasks (though it had these aplenty), but also with some fundamental structure (delegates and delayed message sending and asynchronous APIs) which prioritized UI responsiveness and assumed slow network and input/output operations of all kinds. Soon after, in 2009, Grand Central Dispatch (GCD) was introduce: a technique for creating and scheduling multiple queues of work-chunks independent of the number of processors or threads (effectively hiding thread management from programmers). GCD and Blocks, a technique for writing the work-chunks to put into those GCD queues and to create reusable work-chunks more succinctly than the delegate and callback mechanism, made their way to iOS and Mac OS X by early 2011. GCD and blocks have meant that Apple’s own software like iMovie/Final Cut Pro and iWork can actually use all available processor “for free” without overthinking threading and concurrency. Over the past few years blocks and queues have come to permeate Apple’s APIs – we create graphical animations with blocks, we handle data loading and saving with blocks, we handle synchronizing input and UI with blocks, they are everywhere. And they feel pretty natural to developers I’ve talked with. And way less error prone than traditional multithreading. On the Apple platform developers have, for several years now, been actively breaking up their applications into smaller work-chunks and being encouraged by example code and the APIs to re-organizing around a concurrent programming model which is simpler, less error prone, and more scalable to a future with many, many processors.

That’s the lead-up. Here’s the point.

The subtle but major benefit to a Mac+ARM strategy might be the ability to add many many many more processors to Macs and sell them as the fastest computers that consume the least power – not just matching Intel’s GHz performance or number of processors but radically leapfrogging performance and power because only Apple software and its app developers are positioned to take advantage of so many processors due to how this long-game of shifting to a simpler concurrent programming model has been playing out. And only the Ax / ARM architecture can fit 16 or 32 cores into a smaller power profile than the ~8 core top-of-the-line mobile Intel processors, or 48 or 64 cores into the power profile of the top-of-the-line desktop and server Intel processors. Mac laptops could be lighter, run cooler, last longer on the same battery, and feel dramatically faster running concurrency-aware apps than any Intel-based laptop. Mac Desktop systems, already targeting high-end developers and media professionals who use concurrency-capable software – could be smaller and use much less energy, and would also feel dramatically faster.

Fighting this performance battle would be very difficult for Intel and PC OEMs in the laptop and tablet space given their continuing struggles around price and power consumption – it’s unlikely they can match the number of processors and power consumption combination for 3-5 years, and only then if they were under pressure.  It would also be an uphill battle for Microsoft and PC OEMs without competitive Intel parts. Although they might try to shift to ARM and a provider like Qualcomm might create a 64-bit highly multi-processor ARM part they simply lack the software. Microsoft’s operating system, web-server, and database server are extremely multi-processor capable, but as yet not fully ported to ARM. In addition their APIs are not only in a disjointed state but also not solidly founded in concurrency – legacy apps, originally their strong advantage, become a disadvantage, feeling old and sluggish and consuming more power. Nor does Microsoft have the strong developer following and loyalty they once had due to their ongoing product, platform and API disarray and consumer market share woes. Microsoft’s response to ARM in cloud and backend enterprise apps is pretty straightforward; it’s hard to picture how they could react, or how quickly, to ARM in the consumer space.

Will Mac+ARM happen? I really don’t know, this is just my thoughts about advantages to Mac+ARM that I hadn’t seen anybody else notice. It’s worth thinking and talking about.

4 Responses to “Mac + ARM: one more thing”

  1. dr.one August 11, 2014 at 9:49 pm #

    When looking at Apple, always look how the Price Umbrella fit, if you have Mac
    with $40 processor as opposed to $300 Intel 15W TDP. then all the
    saving goes to Apple pocket not the consumer unless it want’s to compete with Chromebooks at $300 which is why they do three year lease to Education customers.

    There is no way Apple will sell Macs lower price than iPad and there is no way iPad
    is going lower than $100.

    If you ask the Functional Programming guys Blocks and GCD are child’s play
    that is why there going so hog wild over Swift.
    GCD is just fancy ThreadPool at the kernel level, doesn’t Microsoft have that for 10 years.
    Apple extended C Language to add Blocks but the Standard Committee is blocking it
    from becoming the standard. I wonder who is objecting.

    IOS8 CoreImage is now using GPU to render images which in the past just used CPU.
    There is your multi-core processor. performance went from 17 sec to 1 sec
    on 100 MB image.

    Don’t forget Thunderbolt, USB, PCIx all come from Intel chips.

    The real issue is that people are happy with keeping PCs for 5+ years and not upgrading OSs.
    This is reason Cloud is so sexy to Apple. even little trouble with privacy is not going
    to deter iCloud. Apple will force people to iCloud whether they like it or not.

    On the Mac side I see 10-bit processing as the next challenge (color, jpg, H265, monitor)

    • natbro August 12, 2014 at 3:53 am #

      agree that the laptop/desktop prices likely wouldn’t come down the full difference when a $300 Intel part is replaced with $40 or $80 worth of Ax CPUs/GPUs. if it differentiated Macs to make them feel faster and last longer it would actually be something they would raise the price for.
      I hear your point about GCD being a thread-pooler and Blocks being less functional (ha) than “real” functional programming techniques – I’m a functional programming wonk, too. What’s different about GCD+Blocks is that these were introduced to Apple’s entire developer community and into the API at many levels. They have gotten picked up and used and understood for day-to-day tasks by lots of developers. It has been a very practical introduction for many programmers to concurrency. Many more day-to-day programmers are now ready for more functional programming and for Swift (which is pretty damn ideal for the many-processor future) than if they had suddenly rewritten their APIs around Haskell, Clojure, Erlang, or F#.
      It’s a good point that Thunderbolt, USB, PCIx, others are from Intel, and worth noting that in most Intel/OEM deals companion chipsets get a discount when purchased with the CPU.
      I’m not saying everything fits perfectly, but that this is something to think about.

      • Dr.one August 12, 2014 at 8:19 pm #

        Another thing to think about regarding 40 cores,
        It is quite wasteful in a 15W TDP laptop.
        ARM will have to do exactly what Intel has done.
        I think efficiency of ARM goes away as well.

        Another example is HMC, it promises 15x performance compared to DDR3
        but price is that single module of 2 MB is 10 W TDP.
        So now you memory requires heat-sink and fan to cool.

        Best bang for buck is actually speeding up RAM and SSD to match CPU so
        all the waiting and cache will go away.
        Both RAM and SSD are going painful transition to 3D just to another incremental
        shift.

        I am sure you know that OpenCL is just even fancier version of GCD just
        hooked up to LLVM JIT compiler. But it hasn’t been revolutionary because
        algorithms have to be highly tuned and even little copying of data between host and node halves performance.
        Apple not open sourcing the runtime means every one is on their own and
        you would think that multi-core CPU would use that but so far nothing have
        come up except next version of Xeon Phi which basically matches GPU in capability
        and performance.

        Thing with Functional Programming that I am negative on is its promise
        to relieve complexity and have less bugs just because you are using pure mathematics
        and all the read only data in JSON being passed all over the place.

        Computer industry is very fickle success nor failure in projects get any kind of retrospect, knowledge is not passed down and
        new generation forgets the past reason for success. NeXTStep and Obj-C
        being the only real successful OOP API should be one. Not many remember why
        C++, Java, .NET failed come up with good GUI API. Some of them copied some
        idioms and functionality in the hope that all would turn into right solution.
        if not move on the next project.

Trackbacks/Pingbacks

  1. Apple TV + games: redux | iLike.code - March 23, 2015

    […] UPDATE #2: 3/23/2015 – if you want to compare Apple products and consoles based on GPU performance (great article) that it’s also worth thinking about price.  The CPU ($100) and RAM ($85-110) in Sony’s Playstation 4 and Microsoft’s XBox One are a $190-$210 contributor to the cost of building those consoles, with an additional $20 for a high-wattage power-supply to drive them and a supporting $30-40 HDD for storage. Contrast this with Apple’s $22 CPU+GPU, $8-16 for 2-4GB of RAM, and $20-$40 for SSD storage (albeit much less storage). It’s pretty great to utterly own your IP and supply-chain. […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: