Optimized for Engagement: Artificial Intelligence and the Future of Entertainment

You may have seen this striking chair designed by Autodesk’s generative CAD software:

Image: Autodesk

It began with a primitive description of a chair: a platform of certain dimensions held above the ground by a certain height, able to carry a certain amount of weight. From there, a computer tested billions of possible configurations, evaluating each one for strength, stability, and material usage.

Images: Autodesk

As it worked, the software developed what’s known in machine learning as a “gradient”—a map that relates how outcomes change with respect to objectives as you vary inputs. In this case, the inputs are bits of matter added to or subtracted from any part of the chair, and the outcome is a combination of strength, stability, and material usage. As the computer varied the design of the chair, arbitrarily at first, it began to identify the most promising avenues of design and spent its efforts in those directions.

“Gradient descent”—this process of informed search through an almost infinite field of possibilities—is one of the foundations of artificial intelligence, and the ability to perform gradient descent operations has advanced rapidly in the last five years, thanks to both algorithmic advances and cheaper, more powerful computing. Autodesk’s algorithm used it to compose and test billions of design iterations in just a few days, using brute force to imitate human creativity and coming up with a design that is what most human observers would call truly creative.

This is just the beginning. Artificial intelligence will soon be able to generate content in every creative field, from filmmaking to music composition to novel-writing. The owners of these algorithms will choose the constraints and decide what to optimize for. They will almost certainly choose to optimize for engagement, and the way we interact with media will change forever.

The first tangible uses of artificial intelligence have been in classification, like distinguishing between photos of dogs and cats, or deciding whether a social media comment is positive or negative. Neural networks are especially good at these kinds of tasks because they can understand data at multiple scales: they describe complex pools of data using several or many “layers” that are themselves made up of some number of “neurons,” each of which specializes in making a single simple decision, such as whether a small group of pixels in an image could be the tip of a dog’s nose.

Neural networks are trained through a process of gradient descent: the author of the neural network sets an optimization objective (usually the objective is to minimize prediction errors), feeds in lots of data, and lets the computer fish around in an informed way for the parameters that best explain the data it’s fed. At the end of the training process, in theory, not only are the individual neurons—like the one that looks for dog noses—tuned for accuracy, but the larger organizational structure of the neural network has evolved to weigh the outputs from different neurons in sophisticated ways.

These networks tend to wind up after training with layers that operate at different levels of scale: the first network layer might look for edges and textures; the second layer for groups of edges and complex curves; the third layer for small features like eyes, ears, and noses; and the fourth layer for assemblies of features like entire faces.

This ability to operate at multiple levels of scale gives neural networks an uncanny ability to create plausible outputs when they’re essentially run backward and asked to produce images, text, or music. Rather than scanning over, say, an image of a dog and identifying eyes, nose, and teeth and then finding faces, a generative neural network starts with generalized building blocks and then refines upward, gradually combining eyes, noses, teeth, and tails into assemblies that eventually take on the authentic form of a dog.

Here’s a generative neural network learning to create handwritten digits. It’s almost primordial: the generator begins by producing random noise. It then starts creating the kinds of curves and lines that are typical of handwritten digits, but without actually producing differentiated, identifiable digits. Finally, it learns to apply higher-level structure and gets the overall shapes of the digits right.

Images: Jon Bruner

In 2015, a team of researchers at Facebook used a generative neural network to create these images of bedrooms:

Images: Alec Radford, Luke Metz, and Soumith Chintala

Our bar for digital amazement is pretty high, so it’s worth taking a few extra minutes to stare at these images and grasp what’s going on. Asked to invent photos of rooms, a neural network returned images that basically make sense at several levels of scale. Not only are the colors and textures right from a distance, but the images include things you’d expect to see in a bedroom, like beds, pillows, and windows, and those things make sense in their placement and relation to each other: the beds are on the floors, the pillows are on the ends of the beds, and the windows are on the walls.

The Facebook team created these images using a generative adversarial network, or GAN, a neural network model introduced by Ian Goodfellow and his collaborators in 2014. A GAN consists of two neural networks: a “generator” that produces images, and a “discriminator” that determines whether images it’s fed are “real” (from a dataset of actual images, like photos of bedrooms taken from property listings) or “fake” (created by the generator). The networks are trained against each other simultaneously, and, if they’re designed well, eventually reach a stable equilibrium where the generator creates images so realistic that the discriminator can’t distinguish them from real images.

Here are some album covers created by the same kind of GAN. Fed lots of real album covers, it’s able to imitate them at several scales: the colors and textures look realistic, but so do the layouts of the album covers. You can even see that the network understood how different elements on the covers should relate to each other for different musical genres: the dark covers have gothic lettering, just like real heavy metal albums.

Images: Alec Radford, Luke Metz, and Soumith Chintala

The output from these GANs is infinitely variable; changing the seed value that generates any image by even a tiny margin will generate an incrementally different image. Here are some generated images of faces being rotated from right-facing to left-facing:

Images: Alec Radford, Luke Metz, and Soumith Chintala

And it turns out there’s an intriguing logic to how this variation works: if you take a seed value that generates an image of a man with glasses, subtract the value for a man without glasses, and add the value for a woman without glasses, and then use that to seed the generator, you get an image of a woman with glasses.

Images: Alec Radford, Luke Metz, and Soumith Chintala

This kind of computer intelligence—the ability to construct convincing content with realistic structure at several levels of scale—sometimes seems to violate obvious ideas about information. For instance, we take for granted that you can’t really take a low-resolution image and up-scale it in the way that’s seen in police procedurals (“enhance, enhance”).

Video: CBS

We all know that information lost in decreasing the resolution of an image can’t be recovered. But, in fact, it can be recovered—or at least, accurately guessed at, just like a human can peer at a low-resolution image and understand at a basic level what it represents.

In 2017, a group of researchers at Google demonstrated that a generative neural network could reconstruct extremely low-resolution images into faces. At left are the original images; in the center are low-resolution versions; and at right are the reconstructions from the low-resolution versions that were generated by the neural network.

Images: Ryan Dahl, Mohammad Norouzi, and Jonathon Shlens

Beyond basic guessing at information content, neural networks are able to understand and replicate very human aspects of artistic style. Given a photograph (image “A” below) and some representative paintings by Turner, Van Gogh, Munch, Picasso, and Kandinsky (shown in insets), a neural network can identify salient aspects of each painting and apply them to the photo, creating versions of the original image that reflect the style of a visionary painter.

Images: Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros

Neural networks can also translate between fundamentally different kinds of images. Trained with pairs of images in different formats, they can, for instance, take a black-and-white photo and guess at how it might look in color, or translate a daytime photo into its nighttime analog. Given a satellite image, a neural network can create a street-map representation of the same area.

Images: Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge

Neural networks are even able to turn a primitive line drawing into a photorealistic image. In an interactive demonstration by Christopher Hesse, any line drawing can be turned into an image of a cat. From the realistic…

…to the simplistic,

and from the comedic…

…to the grotesque.

Images via interactive feature by Christopher Hesse

The neural network in this case was trained by taking photos of cats, reducing them to line drawings algorithmically, and then pairing them together during training. The same network, trained with the right photos, can turn, for instance, sketches of handbags into photorealistic images of handbags.

Neural networks can also generate convincing text, learning to “translate” between, say, images and descriptions.

Images: Andrej Karpathy and Li Fei-Fei

They can imitate the quirky, ungrammatical, and meandering language of Donald Trump, too. The “Deep Drumpf” Twitter account, populated by a recurrent neural network that’s trained on Donald Trump’s transcripts, manages to reveal something about the president’s fixations and anxieties. The AI-generated tweets, just like Trump’s real ones, include shout-outs to Fox News and inappropriate remarks about minorities.

Tweets by “Deep Drumpf”, created by Brad Hayes

A similar kind of neural network can compose music. Daniel Johnson trained a recurrent neural network on a large corpus of piano music from the 17th through 20th centuries, and wound up with something that sounds like hundreds of years of classical music averaged together (Haydn).

Music: Daniel Johnson

It’s remarkably convincing until the algorithm hits a local minimum in its gradient and starts repeating the same chord over and over again, around 0:26. For each note it plays, it searches for the likeliest next note, and here its imagination falls short; it’s unable to find a better way to move forward than just repeating the same note. Eventually it manages to bump itself out of the rut and move on.

Neural networks that generate realistic videos are perhaps the most astonishing of these. We’re used to animated videos that are transparently invented, but when we see what appears to be “live-action” video, we take for granted that it’s a representation of something real, and that whatever isn’t real required a sophisticated Hollywood studio to fake. It turns out that neural networks can learn the patterns of a particular actor and reconfigure them into entirely new footage.

For this demonstration, the New York Times writer Kevin Roose captured a video sample of his facial expressions, then had a handful of “deepfake” practitioners graft his face into existing videos. Here his face is substituted into a talk-show interview of Jake Gyllenhaal. Remember that Roose never recorded the specific facial expressions that appear in this video; a neural network was able to expand his sample expressions into a full range (or at least as full a range as you’d need for a Jimmy Fallon interview).

Videos: Kevin Roose and “Derpfakes,” via The New York Times

Reddit and Twitter have already banned pornography that uses these techniques to attach celebrities’ faces to other actors’ bodies.

You can understand the implications for politics—and for Russian-style misinformation campaigns—in this video of Barack Obama that a neural network built on an audio sample from one of his speeches. The result is nearly indistinguishable from the real thing. Paired with other neural networks that can generate realistic voices from text, this kind of software could produce a convincing video of any politician saying anything. More than what’s said in any fake videos, the greatest impact might be the complete loss of faith in any audio or video recording. Donald Trump’s Access Hollywood tape could be the last bombshell recording ever to upend a campaign.

Video: Supasorn Suwajanakorn, Steven M. Seitz, and Ira Kemelmacher-Shlizerman

So, there’s a lot here that’s troubling, but also a lot here that’s promising. Computers may handle a lot of the tedium of creative output—writing out arpeggios under a melody line you’ve chosen, or turning a rough sketch into a polished painting. Technology has done that before; serious photographers use Photoshop to retouch their work, and their artistic talents aren’t diminished by the fact that they’ve used a computer instead of dodging and burning in a darkroom. The creative fields will become more accessible; how many more people could compose music if they didn’t need to learn musical notation or the theory of chord progressions? But attaching even these techniques to Web-scale media makes me uneasy.

Think back to the Autodesk chair. A computer designed it by incrementally changing it and testing each change against a set of optimization objectives: be strong, stable, and use as little material as possible. In theory, you could do the same thing to create a book. Maybe you’d start with a familiar text and test individual word substitutions to see if they make it incrementally better.

How would you define “better?” If you want to be commercially successful, you’d set commercial success as your goal—as the optimization objective that your gradient measures as you tweak your content. Each time you make a small change to a piece of content, you’d measure the audience response to that change, and plan your next change accordingly. The metric you’d care about is engagement.

Engagement—the degree to which your audience stays glued to your content—is extremely well-measured on the Internet, and a handful of large, consolidated platforms have especially clear understandings of how audiences engage with content of every kind. Netflix and YouTube know everything about how audiences consume videos; Amazon, through the Kindle, knows everything about how audiences consume books, down to the amount of time you spend on every page. Buzzfeed knows exactly how audiences respond to different words in headlines, and SoundCloud and Spotify know which individual sounds will draw listeners closer or repel them.

The authors of large-scale web content are already deeply attuned to engagement, and the platforms they publish on give them tools that help them identify engaging features at extremely granular levels. Here’s what YouTube shows publishers for every video: second-by-second scores for audience retention. A video author can see just how engaging every shot and every line in the screenplay is.

Of course, creators and publishers have always used feedback loops to evaluate consumer response and adjust their output, especially in high-volume media. Here are twelve pulp-novel covers with Fabio on them. They’re the result of a process where a publisher or an author has identified characteristics that resonate with an audience and then explored minor variations, testing them against sales results and moving in the direction suggested by those results.

It’s mostly a process that works on high-level logic: the publisher might say that dungeon scenes are out and cowboy scenes are in. It’s a creative process guided by commercial results.

Web companies go through the same type of process, but instead of editors they have product managers. Inside all your favorite news, social media, search engine, and B2B customer-relationship management companies are people who tweak things constantly to make the sites just a little more engaging. It’s a manual process; they formulate theories about how users relate to the site and what they want from it, manage developer resources, balance stakeholder needs, and then publish a redesigned “submit” button that’s just a little bit more gratifying to click.

Image: Colin Garven

The algorithmic version of this will not only work faster, but will personalize to the individual, and will work at multiple levels of scale. Algorithms can test everything from broad genres to individual word choices in books, and the timing and appearance of individual shots in films.

You don’t even need to begin with the newest generation of creative AI to do this; you can start with an existing creative work and re-cut it to optimize for engagement. On YouTube, video creators around the world, closely attuned to engagement rates and popular search keywords, create immense volumes of content that differs only slightly from existing content, following audience interest toward higher volumes of viewership. They wind up producing bizarre, nonsensical content that is nevertheless tuned to resonate intensely with some part of YouTube’s viewer base, like children’s videos that show popular cartoon characters in deeply upsetting scenarios. The algorithms I described above will vastly expand the creative envelope that this optimization process works within.

The obvious direction here is that AI will generate not just content that’s optimized to resonate with collective audiences, but highly personalized content that’s optimized to be deeply engaging to individuals. In 2004, the producers of The Polar Express used motion capture technology to digitize actors and animate them. Somewhere in Los Angeles there’s a hard drive that contains the primitives for a substantial portion of Tom Hanks’s acting range. Today Tom Hanks is an actor you can hire to appear on a set and deliver lines to a camera. In a few years he and other stars will be bundles of intellectual property that producers may license and deploy in highly personalized films: a single movie might be delivered as a fast-cut action picture for some viewers and a slow-moving romance for others.

Images: Warner Bros. Pictures

All of this brings to mind David Foster Wallace’s Infinite Jest. The novel, published in 1996, predates most of the key aspects of modern digital entertainment, but it’s remarkably prescient. Its characters are addicted to many things, but the most exotic object of addiction is a film referred to as “the Entertainment.” It is so engaging that it disables anyone who begins to watch it, locking the viewer in a permanent stupor in which he can do nothing but watch the Entertainment over and over again.

A terrorist group is suspected of sending the Entertainment on unlabeled video tapes to, among others, a Middle Eastern diplomatic medical attaché. Wallace’s description of what happens as the attaché begins watching anticipates a future in which algorithms incrementally search through the entire universe of possible creative output, relentlessly optimizing words, images, and musical stings to release just a little more dopamine and keep their consumers engaged just a little bit longer. It’s an outrageous, but familiar, world to those who describe themselves as being “addicted” to social media or mobile apps.

There is a plain brown and irritatingly untitled cartridge-case in a featureless white three-day standard U.S.A. First Class padded cartridge-mailer. The padded mailer is postmarked suburban Phoenix area in Arizona U.S.A., and the returnaddress box has only the term ‘HAPPY ANNIVERSARY!,’ with a small drawn crude face, smiling, in ballpoint ink, instead of a return address or incorporated logo… The medical attaché, in sum, feels tightly wound and badly underappreciated and is prepared in advance to be irritated by the item inside, which is merely a standard black entertainment cartridge, but is wholly unlabelled and not in any sort of colorful or informative or inviting cartridge-case… The attaché will pop the cartridge in and scan just enough of its contents to determine whether it is irritating or of an irrelevant nature and not entertaining or engaging in any way… When he settles in with the tray and cartridge, the TP’s viewer’s digital display reads 1927h.

At 2010h. on 1 April Y.D.A.U., the medical attaché is still watching the unlabelled entertainment cartridge.
…
At 0015h., 2 April,… The medical attaché, at their apartment, is still viewing the unlabelled cartridge, which he has rewound to the beginning several times and then configured for a recursive loop. He sits there, attachéd to a congealed supper, watching, at 0020h., having now wet both his pants and the special recliner.
…
And just before 0045h. on 2 April Y.D.A.U., his wife arrived back home and uncovered her hair and came in and saw the Near Eastern medical attaché and his face and tray and eyes and the soiled condition of his special recliner, and rushed to his side crying his name aloud, touching his head, trying to get a response, failing to get any response to her, he still staring straight ahead; and eventually and naturally she — noting that the expression on his rictus of a face nevertheless appeared very positive, ecstatic, even, you could say — she eventually and naturally turning her head and following his line of sight to the cartridge-viewer.

By mid-afternoon on 2 April Y.D.A.U.: the Near Eastern medical attaché; his devout wife; the Saudi Prince Q————’s personal physician’s personal assistant, who’d been sent over to see why the medical attaché hadn’t appeared at the Back Bay Hilton in the a.m. and then hadn’t answered his beeper’s page; the personal physician himself, who’d come to see why his personal assistant hadn’t come back; two Embassy security guards w/ sidearms, who’d been dispatched by a candidiatic, heartily pissed-off Prince Q————; and two neatly groomed Seventh Day Adventist pamphleteers who’d seen human heads through the living room window and found the front door unlocked and come in with all good spiritual intentions — all were watching the recursive loop the medical attaché had rigged on the TP’s viewer the night before, sitting and standing there very still and attentive, looking not one bit distressed or in any way displeased, even though the room smelled very bad indeed.

Optimized for Engagement: Artificial Intelligence and the Future of Entertainment

Published on March 25, 2018

Jon Bruner