Andy Matuschak re: Vision Pro

Created

Jun 14, 2023 1:38 AM

Paradigm

Given how ambitious the hardware package is, the software paradigm is surprisingly conservative. visionOS is organized around “apps”, which are conceptually defined just like apps on iOS:

to perform an action, you launch an app which affords that activity; no attempt is made to move towards finer-grained “activity-oriented computing”
apps present interface content, which is defined on a per-app basis; app interfaces cannot meaningfully interact, with narrow carve-outs for channels like drag-and-drop
(inferred) apps act as containers for files and documents; movement between those containers is constrained

I was surprised to see that the interface paradigm is classic WIMP. At a high level, the pitch is not that this is a new kind of dynamic medium, but rather that Vision Pro gives you a way to use (roughly) 2D iPad app UIs on a very large, spatial display. Those apps are organized around familiar UIKit controls and layouts. We see navigation controllers, split views, buttons, text fields, scroll views, etc, all arranged on a 2D surface (modulo some 3D lighting and eye tracking effects). Windows, icons, menus, and even a pointer (more on that later).

These 2D surfaces are in turn arranged in a “Shared Space”, which is roughly the new window manager. My impression is that the shared space is arranged cylindrically around the user (moving with them?), with per-window depth controls, but I’m not yet sure of that. An app can also transition into “Full Space”, which is roughly like “full screening” an app on today’s OSes.

In either mode, an app can create a “volume” instead of a “window”. We don’t see much of this yet: the Breathe app spreads into the room; panoramas and 3D photography is displayed spatially; a CAD app displays a model in space; an educational app displays a 3D heart. visionOS’s native interface primitives don’t make use of a volumetric paradigm, so anything we see here will be app/domain-specific (for now).

Input

For me, the most interesting part of visionOS is the input part of the interaction model. The core operation is still pointing. On NLS and its descendants, you point by indirect manipulation: moving a cursor by translating a mouse or swiping a trackpad, and clicking. On the iPhone and its descendants, you point by pointing. Direct manipulation became much more direct, though less precise; and we lost “hover” interactions. On Vision Pro and its descendants, you point by looking, then “clicking” your bare fingers, held in your lap.

Sure, I’ve seen this in plenty of academic papers, but it’s quite wild to see it so central to a production device. There are other VR/AR devices which feature eye tracking, but (AFAIK) all still ship handheld controllers or support gestural pointing. Apple’s all in on foveation as the core of their input paradigm, and it allows them to produce a controller-free default experience. It reminds me of Steve’s jab at styluses at the announcement of the iPhone.

My experiences with hand tracking-based VR interfaces have been uniformly unpleasant. Without tactile feedback, the experience feels mushy and unreliable. And it’s uncomfortable after tens of seconds (see also Bret’s comments). The visionOS interaction model dramatically shifts the role of the hands. They’re for basically-discrete gestures now: actuate, flick. Hands no longer position the pointer; eyes do. Hands are the buttons and scroll wheel on the mouse. Based on my experiences with hand-tracking systems, this is a much more plausible vision for the use of hands, at least until we get great haptic gloves or similar.

But it does put an enormous amount of pressure on the eye tracking. As far as I can tell so far, the role of precise 2D control has been shifted to the eyes. The thing which really sold the iPhone as an interface concept was Bas’s and Imran’s ultra-direct, ultra-precise 2D scrolling with inertia. How will scrolling feel with such indirect interaction? More importantly, how will fine control feel—sliders, scrubbers, cursor positioning? One answer is that such designs may rely on “direct touch”, akin to existing VR systems’ hand tracking interactions. Apple suggests that “up close inspection or object manipulation” should be done with this paradigm. Maybe the experience will be better than on other VR headsets I’ve tried because sensor fusion with the eye tracker can produce more accuracy?

By relegating hands to a discrete role in the common case, Apple reinforces the 2D conception of the visionOS interface paradigm. You point with your eyes and “click” with your hands. One nice benefit of this change is that we recover a natural “hover” interaction. But moving incrementally from here to a more ambitious “native 3D” interface paradigm seems like it would be quite difficult.

For text, Apple imagines that people will use speech for quick input and a Bluetooth keyboard for long input sessions. They’ll also offer a virtual keyboard you can type on with your fingertips. My experience with this kind of virtual keyboard has been uniformly bad—because you don’t have feedback, you have to look at the keyboard while you type; accuracy feels effortful; it’s quickly tiring. I’d be surprised (but very interested) if Apple has solved these problems.

Strategy

Note how different Apple’s strategy is from the vision in Meta’s and MagicLeap’s pitches. These companies point towards radically different visions of computing, in which interfaces are primarily three-dimensional and intrinsically spatial. Operations have places; the desired paradigm is more object-oriented (“things” in the “meta-verse”) than app-oriented. Likewise, there are decades of UIST/etc papers/demos showing more radical “spatial-native” UI paradigms. All this is very interesting, and there’s lots of reason to find it compelling, but of course it doesn’t exist, and a present-day Quest / HoloLens buyer can’t cash in that vision in any particularly meaningful way. Those buyers will mostly run single-app, “full-screen” experiences; mostly games.

But, per Apple’s marketing, this isn’t a virtual reality device, or an augmented reality device, or a mixed reality device. It’s a “spatial computing” device. What is spatial computing for? Apple’s answer, right now, seems to be that it’s primarily for giving you lots of space. This is a practical device you can use today to do all the things you already do on your iPad, but better in some ways, because you won’t be confined to “a tiny black rectangle”. You’ll use all the apps you already use. You don’t have to wait for developers to adapt them. This is not a someday-maybe tech demo of a future paradigm; it’s (mostly) today’s paradigm, transliterated to new display and input technology. Apple is not (yet) trying to lead the way by demonstrating visionary “killer apps” native to the spatial interface paradigm. But, unlike Meta, they’ll build their device with ultra high-resolution displays and suffer the premium costs, so that you can do mundane-but-central tasks like reading your email and browsing the web comfortably.

On its surface, the iPhone didn’t have totally new killer apps when it launched. It had a mail client, a music player, a web browser, YouTube, etc. The multitouch paradigm didn’t substantively transform what you could do with those apps; it was important because it made those apps possible on the tiny display. The first iPhone was important not because the functionality was novel but because it allowed those familiar tools to be used anywhere. My instinct is that the same story doesn’t quite apply to the Vision Pro, but being generous for a moment, I might suggest its analogous contribution is to allow desktop-class computing in any workspace: on the couch, at the dining table, etc. “The office” as an important, specially-configured space, with “computer desk” and multiple displays, is (ideally) obviated in the same way that the iPhone obviated quick, transactional PC use.

Relatively quickly, the iPhone did acquire many functions which were “native” to that paradigm. A canonical example is the 2008 GPS-powered map, complete with local business data, directions, and live transit information. You could build such a thing on a laptop, but the amazing power of the iPhone map is that I can fly to Tokyo with no plans and have a great time, no stress. Rich chat apps existed on the PC, but the phenomenon of the “group chat” really depended on the ubiquity of the mobile OS paradigm, particularly in conjunction with its integrated camera. Mobile payments. And so on. The story is weaker for the iPad, but Procreate and its analogues are compelling and unique to that form factor. I expect Vision Pro will evolve singular apps, too; I’ll discuss a few of interest to me later in this note. Will its story be more like the iPhone, or more like the iPad and Watch?

It’s worth noting that this developer platform strategy is basically an elaboration of the Catalyst strategy they began a few years ago: develop one app; run it on iOS and macOS. With the Apple Silicon computers, the developer’s participation is not even required: iPad apps can be run directly on macOS. Or, with SwiftUI, you can at least use the same primitives and perhaps much of the same code to make something specialized to each platform. visionOS is running with the same idea, and it seems like a powerful strategy to bootstrap a new platform. The trouble here has been that Catalyst apps (and SwiftUI apps, though somewhat less so) are unpleasant to use on the Mac. This is partially because those frameworks are still glitchy and unfinished, but partially because an application architecture designed for a touch paradigm can’t be trivially transplanted to the information/action-dense Mac interface. Apple makes lots of noises in their documentation about rethinking interfaces for the Mac, but in practice, the result is usually an uncanny iOS app on a Mac display. Will visionOS have the same problem with this strategy? It benefits, at least, from not having decades of “native” apps to compare against.

Dreams

If I find the Vision Pro’s launch software suite conceptually conservative, what might I like to see? What sorts of interactions seem native to this paradigm, or could more ambitiously fulfill its unique promise?

Huge, persistent infospaces: I love this photo of Stewart Brand in How Buildings Learn. He’s in a focused workspace, surrounded by hundreds of photos and 3”x5” cards on both horizontal and vertical surfaces. It’s a common trope among writers: both to “pickle” yourself in the base material and to spread printed manuscript drafts across every available surface. I’d love to work like this every day, but my “office” is a tiny corner of my bedroom. I don’t have room for this kind of infospace, and even if I did, I wouldn’t want to leave it up overnight in my bedroom. There’s tremendous potential for the Vision Pro here. And unlike the physical version, a virtual infospace could contend with much more material than could actually fit in my field of view, because the computational medium affords dynamic filtering, searching, and navigation interactions (see Softspace for one attempt). And you could swap between persistent room-scale infospaces for different projects. I suspect that visionOS’s windowing system is not at all up to this task. One could prototype the concept with a huge “volume”, but it would mean one’s writing windows couldn’t sit in the middle of all those notes. (Update: maybe a custom Shared Space would work?)

Ubiquitous computing, spatial computational objects: The Vision Pro is “spatial computing”, insofar as windows are arranged in space around you. But it diverges from the classic visions along these lines (e.g. Mark Weiser’s ubiquitous computing, Dynamicland) in that the computation lives in windows. What if programs live in places, live in physical objects in your space? For instance, you might place all kinds of computational objects in your kitchen: timers above your stove; knife work reference overlays above your cutting board; a representation of your fridge’s contents; a catalog of recipes organized by season; etc. Books and notes live not in a virtual 2D window but “out in space”, on my coffee table (solving problems of Peripheral vision). When physical, they’re augmented—with cross-references, commentary from friends, practice activities, etc. Some are purely digital. But both signal their presence clearly from the table while I’m wearing the headset. My memory system is no longer stuck inside an abstract practice session; practice activities appear in context-relevant places, ideally integrating with “real” activities in my environment, as I perform them.

Shared spatial computing: Part of these earlier visions of spatial computing, and particularly of Dynamicland, is that everything I’m describing can be shared. When I’m interacting with the recipe catalog that lives in the kitchen, my wife can walk by, see the “book” open and say “Oh, yeah, artichokes sound great! And what about pairing them with the leftover pork chops?” I’ll reserve judgment about the inherent qualities of the front-facing “eye display” until I see it in person, but no matter how well-executed that is, it doesn’t afford the natural “togetherness” of shared dynamic objects. Particularly exciting will be to create this kind of “togetherness” over distance. I think a “minimum viable killer app” for this platform will be: I can stand at my whiteboard, and draw (with a physical marker!), and I see you next to me, writing on the “same surface”—even though you’re a thousand miles away, drawing on your own whiteboard. FaceTime and Freeform windows floating in my field of view don’t excite me very much as an approximation, particularly since the latter requires “drawing in the air.”

Deja vu

A few elements of visionOS’s design really tickled me because they finally productized some visual interface ideas we tried in 2012 and 2013. It’s been long enough now that I feel comfortable sharing in broad strokes.

The context was that Scott Forstall had just been fired, Jony Ive had taken over, and he wanted to decisively remake iOS’s interface in his image. This meant aggressively removing ornamentation from the interface, to emphasize user content and to give it as much screen real estate as possible. Without borders, drop shadows, and skeuomorphic textures, though, the interfaces loses cues which communicate depth, hierarchy, and interactivity. How should we make those things clear to users in our new minimal interfaces? With a few other Apple designers and engineers1, I spent much of that year working on possible solutions that never shipped.

You might remember the “parallax effect” from iOS 7’s home screen, the Safari tabs view, alerts, and a few other places. We artificially created a depth effect using the device's motion sensors. Internally, even two months before we revealed the new interface, this effect was system-wide, on every window and control. Knobs on switches and scrubbers floated slightly above the surface. Application windows floated slightly above the wallpaper. Every app had depth-y design specialization: the numbers in the Calculator app floated way above the plane, as if they were a hologram; in Maps, pins, points of interest, and labels floated at different heights by hierarchy; etc. It was eventually deemed too much (“a bit… carnival, don't you think?”) and too battery-intensive. So it's charming to see this concept finally get shipped in visionOS, where UIKit elements seem to get the same depth-y treatments we'd tried in 2012/2013. It's much more natural in the context of a full 3D environment, and the Vision Pro can do a much better job of simulating depth than we'd ever manage with motion sensors.

A second concept rested on the observation that the new interface might be very white, but there are lots of different kinds of white: acrylic, paper, enamel, treated glass, etc. Some of these are “flat”, while others are extremely reactive to the room. If you put certain kinds of acrylic or etched glass in the middle of a table, it picks up color and lighting quality from everything around it. It’s no longer just “white”. So, what if interactive elements were not white but “digital white”—i.e. the material would be somehow dynamic, perhaps interacting visually with their surroundings? For a couple months, in internal builds, we trialled a “shimmer” effect, almost as if the controls were made of a slightly shiny foil with a subtly shifting gloss as you moved the device (again using the motion sensors). We never could really make it live up to the concept: ideally, we wanted the light to interact with your surroundings. visionOS actually does it! They dynamically adapt the control materials to the lighting in your environment and to your relative pose. And interactive elements are conceptually made of a different material which reacts to your gaze with a subtle gloss effect! Timing is everything, I suppose…

Only some of the WWDC videos about the Vision Pro have been released so far. I imagine my views will evolve as more information becomes available.

Surprising design features

So far, from watching 2023-06-06’s sessions:

ARKit persists anchors and mapping data based on your location.

Practically speaking, this means that if you use an app to anchor some paintings to your wall at home, then go to your office, you won’t see the paintings there. You can persist new paintings on the walls in your office. Then when you return home, the device will automatically reload the anchors and map associated with that location—i.e. you’ll see the objects you anchored at home.
This seems like a critical component of the more ambitious spatial computing paradigms I allude to above, and particularly to many Ambient computing/ubiquitous computing ideas.

It’s very clever, and very important to the system’s overall aesthetic, that they’ve made most interface materials “glass”. This decision leans into the device’s emphasis on AR—computing within your space—rather than on VR—computing within an artificial space.

Not only are these transparent, but their lighting reflects the lighting of your space. And it’s not just backgrounds—secondary and tertiary text elements are tinted (“vibrancy”) to make them coordinate with the scene around you.
The net effect is lightness: you can put more “interface” in your visual field without making it feel claustrophobic.

This is a “virtual AR” device, so it’s easy to be fooled into thinking that it’s “just” displaying the camera feed, as Quest does. But it also gathers mesh and lighting information about the room, and it uses that to transform the internal projection of the environment in certain cases.

For instance, the film player bounces simulated emissive light from the “screen” onto the ceiling and floor, while “dimming” the rest of the room.
Controls and interface elements cast shadows onto the 3D environment.

“Spatial Personas” allow FaceTime sessions to “break out of” the rectangle and be displayed as simulated 3D representations in space. Pretty wild, if uncanny.
The window/volume-centric abstraction allows SharePlay to fully manage synchronization of shared content and personas in 3D space; the app developer just specifies whether this is a “gather-round” type interaction, a “side-by-side” type interaction, etc.

These are interesting abstractions, and I can see how they really lower the floor to making interactions “sharable by default” with very little developer intervention, even while they constrain away some more ambitious modalities.

In the absence of haptics, spatial audio is emphasized as a continuous feedback mechanism. Unlike most personal computers, this device has built-in speakers which (I’m guessing) can be assumed to be on by default, because there’s probably little leakage to the environment. That’s pretty interesting.

1 Something in the Apple omertà makes me uncomfortable naming my collaborators as I normally would, even as I discuss the project itself. I guess it feels like I’d be implicating them in this “behind-the-scenes” discussion without their consent? Anyway, I want to make clear that I was part of a small team here; these ideas should not be attributed to me.