Archive | August, 2014

Mac + ARM: one more thing

11 Aug

I’ve written appreciatively of Apple’s vertical integration and also about their Ax architecture, noting that the Imagination graphics processor could be readily boosted within a plugged-in device like the Apple TV to deliver console-quality games. M. Gassée’s MacIntel: The End Is Nigh and Mr. Richman’s Apple and ARM, Sitting In A Tree suggest many potential benefits in supply-chain control, overall cost, and battery life to an even deeper “Mac+ARM” vertical integration strategy which would shift Macs to use ARM processors instead of Intel’s more expensive, power-consuming x86 architecture. There are plenty of arguments to be made against such a shift to ARM, but there is a subtle trend at play to Mac+ARM if you are a programmer that it seems to me could make such a shift a compelling and strongly differentiable position. It’s an argument I haven’t heard from anybody else. It’s about how we improve the performance of this software stuff that’s eating the world.

Game performance as an example

To understand the situation it’s worth taking a quick look at game software as an example. Today a great deal of game software is built against software engines like Source, Unity and Unreal that can detect different graphics hardware and adjust how a game runs and also ease cross-targeting different operating systems (Windows, Mac, PC), mobile devices, and consoles from the same source code and game artwork. The way game software runs atop game engines often allow existing games to “look better for free” or with minimal added effort on newer devices with better graphics processors (GPUs) and to degrade gracefully on older devices. The same software can run at faster frame-rates, so animations are more pleasing and buttery. The same software renders more detailed, higher-resolution content with finer textures and richer special effects in lighting or smoke or fire or fog. The same software may support more game-controlled opponents behaving more realistically, adding complexity and realism to the game. Just by updating the graphics processor, the GPU.

Two reasons this happens so strikingly with games and GPU hardware are (1) there is a natural parallelism to graphics rendering itself, but also importantly (2) the natural way to organize game software and the engines around that natural rendering parallelism is a very approachable concurrent programming model for programmers. Upgrade a GPU with more parallel “core” elements which can render more triangles, more complex textures, more lighting effects and perform more physical simulation every 30th or 60th of a second… the software “logic chunks” of the game software have already been split by programmers into small pieces that these GPU cores and the game engine can parallelize to do more each frame. Newer GPUs can often take the same source artwork and render more detailed characters or draw texturing images with more fidelity on more of the screen or to a bigger screen, or realize a more realistic lighting or fog effect which was already part of the software’s definition. (A similar type of free speedup effect seems afoot with iOS8’s Metal.)

“Free speedup” isn’t a new thing in software; it’s a natural consequence of Moore’s Law, after all. Free speedup happened for gaming and non-gaming software on the PC/Wintel platform through about 2002, almost entirely due to increases in clock speeds (what you see as the Gigahertz, or GHz, of your computer – it’s a measure of how many billions of individual operations like adding, subtracting, or moving data around your processor can do every second). From the birth of PCs until the early 2000’s newer machines arrived with speedier processors as well as more and faster memory and other faster system components.  New PCs automatically improved the performance of existing software like your operating system and the few apps you were using: a browser, Word, Excel, Adobe Photoshop, etc. Each year a new PC felt like a dramatic performance improvement for you as a user, and the easily perceived productivity improvements helped drive rapid PC replacement cycles.

In the early 2000’s we began reaching the “thermal limit” of the processor speed race – we couldn’t make a single processor run at faster speeds without literally melting your laptop. Intel began adding additional processors to a single chip and using techniques such as “hyperthreading” to add even more “virtual” processors. Instead of having a single processor chip running at 4GHz we have a single chip which has 2, 4 or 8 2GHz processors. Having more real and virtual processors made the operating system more responsive and also gave “free speedup” benefits to PCs being used in server environments. Database servers and web servers often run identical software fragments for every user connected to them, every time a request for a web page or a piece of data comes in; this is the same kind of naturally parallelizable software that gets a free ride from having more processors even when the speed of any single processor is not much faster.  Unchanged, plenty of server software can handle more simultaneous users or web-connections or database queries on hardware with more processors, without most high-level programmers having to do a complex rewriting to accommodate the change. Their software was already written for concurrency.

Alas, most desktop software hasn’t gotten as much of a free ride in the last decade from multiple processors or multiple cores. You experience some speed improvements when you buy a new machine due to faster graphics, more memory, faster solid-state disks, or a faster network, but not like you used to in the 90’s. It’s telling that people are more excited these days by a faster internet connection than by a brand new laptop! It turns out that under the covers our apps and their user-interfaces are built using not-very-concurrent software programming techniques, so those extra processors alone don’t make your favorite apps feel much faster. Unless you are a programmer running many tools at once (like me!) or you work with specific high-end media software which has been painstakingly re-written to take advantage of all those processors, not much free speedup for you. Personally I think this lack of perceived speedup may account for some part of the decline in the PC industry – slower replacement cycles and less consumer desire to upgrade because there is no obvious benefit to a new machine.

Here’s a big part of why this happened:

The traditional software programming technique for concurrency, for taking advantage of multiple processors, is called multithreading, where the work of your software is manually broken into smaller pieces which are given to different processors to work on. In school every programmer is taught about threading and suffers through logic tests about semaphores, mutexes and other mind-numbing locking and synchronization techniques. It turns out that although low-level developers of operating system kernels, database engines, web-servers and some games and game-engines can pull off this form of concurrent programming to get the most out of multiple processors, most programmers (the ones building all your apps) are easily confused by threading and either can’t get it to work properly or can’t get it to work well when there are many actual threads running on many actual processors. Programmers don’t do the multithreading work or don’t do it well; most apps don’t feel much faster. Edward E. Lee’s famous 2006 “The Problem With Threads” pointed out that basic threads “discard the most essential and appealing properties of sequential computation: understandability, predictability, and determinism” and he suggested new programming language techniques to facilitate concurrent programming, to make it easier for programmers to do well.

As a response to these trends and then in response to the very pronounced performance impact of long-running and slow-network-constrained apps on mobile devices – things like the lag of touchscreens and unresponsiveness of buttons and lists if software is not concurrent enough -various platforms introduced new concurrent programming features around this time in an attempt to push new software into taking advantage of a future with many processors. Some of them seem to have taken Mr. Lee’s insights to heart about simplifying concurrency for programmers. Others did not.

Java introduced java.util.concurrent in 2006 with some useful queuing and “futures” features, but also with many simplistic and mostly not useful wrappers around the traditional complex threading models. As part of its response to sluggish UI compared to iOS, Android followed up in 2010 with the addition of AsyncTask as well as guidelines for programmers to “do more work” in separate threads. In my opinion, Java and Android have not taken Mr. Lee’s insight very deeply to heart. Programmers can use some concurrent programming techniques, but concurrent programming is not the norm.

In mid-2010, Microsoft introduced Parallel Extensions to its .NET platform and runtime, then more “completion-based” and quasi-asynchronous APIs for Windows Phone through 2011 to prevent long-running operations from causing UI stutter and hangs, and finally introduced new await/async keywords in the mid-2012 C# 5.0 update. I think Microsoft folks definitely took the global trends and Mr. Lee’s insights to heart when building PLINQ and TPL, but their lack of platform focus & consistent messaging in the past few years and troubles with Windows Phone market share has meant that their concurrent programming model has not caught on deeply with developers. Also, although many developers love and use C#, the concurrent programming model does not permeate all of the many disjointed Microsoft APIs and so software is not yet being broken up to take strong advantage of the future with many processors.

Apple’s iOS launched with the iPhone in 2007, then to developers as an SDK and platform in early 2008. It arrived with a great deal of natural concurrency over its entire API surface. Not just guidelines for which APIs to use when or an admonishment to add threads for long-running tasks (though it had these aplenty), but also with some fundamental structure (delegates and delayed message sending and asynchronous APIs) which prioritized UI responsiveness and assumed slow network and input/output operations of all kinds. Soon after, in 2009, Grand Central Dispatch (GCD) was introduce: a technique for creating and scheduling multiple queues of work-chunks independent of the number of processors or threads (effectively hiding thread management from programmers). GCD and Blocks, a technique for writing the work-chunks to put into those GCD queues and to create reusable work-chunks more succinctly than the delegate and callback mechanism, made their way to iOS and Mac OS X by early 2011. GCD and blocks have meant that Apple’s own software like iMovie/Final Cut Pro and iWork can actually use all available processor “for free” without overthinking threading and concurrency. Over the past few years blocks and queues have come to permeate Apple’s APIs – we create graphical animations with blocks, we handle data loading and saving with blocks, we handle synchronizing input and UI with blocks, they are everywhere. And they feel pretty natural to developers I’ve talked with. And way less error prone than traditional multithreading. On the Apple platform developers have, for several years now, been actively breaking up their applications into smaller work-chunks and being encouraged by example code and the APIs to re-organizing around a concurrent programming model which is simpler, less error prone, and more scalable to a future with many, many processors.

That’s the lead-up. Here’s the point.

The subtle but major benefit to a Mac+ARM strategy might be the ability to add many many many more processors to Macs and sell them as the fastest computers that consume the least power – not just matching Intel’s GHz performance or number of processors but radically leapfrogging performance and power because only Apple software and its app developers are positioned to take advantage of so many processors due to how this long-game of shifting to a simpler concurrent programming model has been playing out. And only the Ax / ARM architecture can fit 16 or 32 cores into a smaller power profile than the ~8 core top-of-the-line mobile Intel processors, or 48 or 64 cores into the power profile of the top-of-the-line desktop and server Intel processors. Mac laptops could be lighter, run cooler, last longer on the same battery, and feel dramatically faster running concurrency-aware apps than any Intel-based laptop. Mac Desktop systems, already targeting high-end developers and media professionals who use concurrency-capable software – could be smaller and use much less energy, and would also feel dramatically faster.

Fighting this performance battle would be very difficult for Intel and PC OEMs in the laptop and tablet space given their continuing struggles around price and power consumption – it’s unlikely they can match the number of processors and power consumption combination for 3-5 years, and only then if they were under pressure.  It would also be an uphill battle for Microsoft and PC OEMs without competitive Intel parts. Although they might try to shift to ARM and a provider like Qualcomm might create a 64-bit highly multi-processor ARM part they simply lack the software. Microsoft’s operating system, web-server, and database server are extremely multi-processor capable, but as yet not fully ported to ARM. In addition their APIs are not only in a disjointed state but also not solidly founded in concurrency – legacy apps, originally their strong advantage, become a disadvantage, feeling old and sluggish and consuming more power. Nor does Microsoft have the strong developer following and loyalty they once had due to their ongoing product, platform and API disarray and consumer market share woes. Microsoft’s response to ARM in cloud and backend enterprise apps is pretty straightforward; it’s hard to picture how they could react, or how quickly, to ARM in the consumer space.

Will Mac+ARM happen? I really don’t know, this is just my thoughts about advantages to Mac+ARM that I hadn’t seen anybody else notice. It’s worth thinking and talking about.