How Can Image-Makers Open Up AI’s Mysterious “Black Box”?
To mitigate the crushing sameness of AI imagery, two researchers are turning to photographs made during the Great Depression.
Images generated with AI using the prompt “a portrait photograph of a doctor by Gordon Parks” (detail), 2024
Consider the below pair of images, both generated with the same prompt and the same basic generative AI model, Stable Diffusion XL (SDXL), the state-of-the-art open-source image-generation model of mid-2024. The prompt: “a photograph of a mother and child on the back of a truck by Dorothea Lange.” Best known for her work for the Farm Security Administration (FSA) documenting the rural poor during the Great Depression, Lange took many of the twentieth century’s most iconic photographs of destitution and perseverance, most famously Migrant Mother (1936). But the image on the left appears less like social documentary and more like a publicity still from a black-and-white Hollywood movie: airbrushed smooth skin, shiny white teeth peering out of a wide smile, and fresh, clean clothing. We can practically sense Clark Gable or Charlie Chaplin just out of frame. (For the moment, let’s ignore the telltale AI inconsistencies, most egregiously the woman’s misshapen hand, which mirrors that of the child, and the truck reduced to a doorframe.) Stylistically, this AI image bears no substantive resemblance to the work of Dorothea Lange.

The image on the right, however, is a different beast. Fingers are still garbled. There are too many hands and not enough arms. Newer AI models have largely fixed anatomy problems, but they’ve exacerbated the stylistic issues that are the subject of this essay. Take the creations of DALL·E 3 (ChatGPT4o), Grok, and Flux, the leading AI image generators of now: The anatomy is nearly flawless, but the images hardly call to mind FSA photography, let alone the specific style of Lange.

In the image on the right of the initial pair, although the anatomy is jumbled, the style feels like Lange’s to a degree that is wholly absent from the other images. As discussed in our article “Generative Style,” in the December 2024 issue of Aperture, we generated the image on the right by fine-tuning SDXL on a large corpus of Lange’s FSA photographs. The image of the left is blandly attractive and ready to be posted on a corporate Instagram account. The image on the right remains strikingly faithful to Lange’s photographic content and style, especially when we remember that it is nothing more than a data visualization. Visually, the difference between the two images is obvious. How do we account for that difference technically?
It is not possible to directly know how, where, or why each bit of meaning and data is stored in an AI model. The hidden layers that contain this information are known as the model’s latent space. This is the locus of the notorious “black box” of AI. When a prompt moves through an imaging model, the relevant data is retrieved from the latent space, transformed through various mathematical operations, and processed into a final visual form. The major generative AI models have purposely constructed their latent space to create blandly attractive images. The technical mechanism for this is called aesthetic scoring.

Image generation, like all of AI, is a product. The companies behind the models have determined that their most viable output is an image that is fun, attractive, and likely to be shared on social media. To create the blandly attractive images most users like—or have been trained to like—AI models often rely on quantified and codified metrics of image qualities rated as desirable by the initial core user base, namely tech bros. Results that satisfy benchmarks for “beauty,” “composition”, and “coherence” are accentuated, and those which do not are suppressed. This automation has a homogenizing effect on many of the subjects of image generation. Broad white smiles, waxy smooth skin, piercing eyes, and soaring cheekbones are enforced across the board, even for destitute farmers. Our attempt to recreate the style of specific photographers—a pursuit we have termed latent specificity—requires the negation of aesthetic scoring and its homogenizing sheen.
In order to study AI image generation, it is crucial to limit all settings available to us and test changes in a controlled manner. The most basic of these elements is that of the “seed.” The seed is nothing more than an arbitrary number that injects some randomness at the initial stage of image (and language) generation. If all other settings are held consistent but the seed is allowed to change, we will observe variation within a narrow range. For no reason other than randomness, some seeds produce better images than others—if the output is not quite satisfying, simply roll again.

Image generation is not a truly stochastic process. Midjourney, OpenAI, and Google do not expose the full suite of controls available to us with open-source solutions. If we lock our seed and all other settings, the AI will give us the exact same answer—in this case an image—again and again. This should disabuse us of any notions of a “ghost in the machine” or agency on the part of the model. Data in, data out.
Prompting can take us only so far. Rather than spending hours writing ever-lengthening formalistic descriptions in the hopes of dredging from the model some semblance of an FSA image, we sought to bring some of the latent space itself under our control. Training a portion of an AI model is known as fine-tuning, and there are countless approaches to doing so. We elected to use the method known as Low-Rank Adaptation (LoRA). The technical details of what a LoRA is and how it functions are well beyond the scope of this article. How one goes about training and implementing a LoRA is as much art as it is science (not unlike much of AI data science). In brief, LoRAs build on a model’s deep training and can nudge its style toward that of a particular set of images—a separate and smaller training dataset.
To train models capable of producing compelling simulations of photographs in the styles of Dorothea Lange and Gordon Parks during their time with the FSA, we gathered roughly 1,000 public-domain photographs by Lange photographs and 1,400 by Parks from the Library of Congress. By contrast, SDXL was trained on billions of image-text pairs gathered in open-source training datasets like LAION-5B. We did not have the technical resources to fine-tune the entire latent space of SDXL. Instead, we trained each of our LoRAs to nudge the model toward the data that more closely aligned with the images in our smaller datasets. We introduced latent specificity into the model without changing its underlying architecture or data.

Data, Compute, and Their Costs
Data scientists now recognize that data quality is as important as data quantity. The new mantra of model training is “garbage in, garbage out.” Accordingly, of the 175,000 black-and-white photographs available in the Library of Congress’s archives and the many thousands of photographs by Lange and Parks, we focused on the 1,000 to 1,400 photographs that best embodied the subjects and style we were after. Precision curation is impossible for a base model of billions of image-text pairs, but it is possible and necessary for fine-tuning.
Even as data scientists now place greater emphasis on data quality and curation, there is no question that AI models remain enormously hungry for seemingly endless amounts of data and computational power, or compute. To train our modest LoRA models, we ran a high-end computer outfitted with an Nvidia 3090 card continuously for several days. When we wrote our essay for Aperture in the middle of 2024, SDXL was the premier open-source model. In the ensuing months—an eternity in computing, let alone AI—it has been wholly eclipsed by larger and more proficient models. These models dwarf SDXL in capability and parameter counts. Yet their size is a double-edged sword: They can create stunning images, but their scale makes them exponentially more intensive to train. What took us a few dozen hours with SDXL would take several months of energy-intensive, nonstop computer processing on the current leading models. Our high-end computer and single Nvidia 3090 card would be woefully insufficient to experiment with these models. We would have to rent cloud-based equipment or purchase tens of thousands of dollars of hardware. Furthermore, the carbon footprint of the compute used to create the images in this article is not negligible. As an industry, AI is the fastest-growing carbon polluter on the planet.

LoRA training is an incremental series of steps during which the model learns from our curated set of images. These steps can be grouped into discrete, functional units known as epochs. By creating plots of epochs where we lock the seed, we create AI’s version of a contact sheet. Here we can examine the influence of the LoRA in isolation. Using a LoRA is not an all-or-nothing proposition—we can determine the percentage at which it is applied to the model. This is often helpful to examine which elements of the image are altered and at what point.
There is no correct answer to be found when training an AI; it is “finished” when what we produce satisfies our goals. Not all epochs are created equal, and more is not necessarily better. In our research, we never judged the final epoch to be the best for our purposes. The latent space is not endlessly flexible, and overemphasizing our LoRA data seems to degrade not just the appearance of the style we are striving for, but the image as a whole.
In the following grid of images, we observe the tension between the style dictated by the LoRA and the predispositions of SDXL. The first two columns present images from early epochs we judge as successful; the third column includes images from the final epoch of the training, and they are riddled with problems. Instead of a girl studiously reading a book, the third image renders a figure with unclear anatomy obscured by a massive book. Instead of a boy wearing a shabby baseball cap, we find a strange mass of textures and shapes placed upon his head. Finally, rather than a wizened gentleman, we have a figure with limbs of unclear origin and number sprouting from the bench on which he sits. The LoRA can impose its style, but when it is applied too strongly, the battle against aesthetic scoring results in a Pyrrhic victory where the image has been pushed past cohesion.

Epochs 2, 6, 10
The inherent variation in LoRAs often elicits questions that might not have been raised prior to training. Below are seed- and prompt-locked images. The three prompts are for portrait photographs of a banker, a doctor, and a lawyer by Dorothea Lange. Using any one of the LoRA’s epochs, the model will produce the exact same image every time. But the variations across epochs are striking. There is nothing in our process we can highlight to explain why the first epoch favors female-presenting subjects, the second male-presenting subjects, nor how the third has found a less strongly defined gender expression.

Epochs 5,6,7
AI imaging enables endless variations, but within a painfully narrow range. The seeming uniqueness and fundamental similarity of these images speaks to the underlying truth of AI imaging. As Adorno and Horkheimer protested in the 1940s: “The conformism of the buyers and the effrontery of the producers who supply them prevail. The result is a constant reproduction of the same thing.” The clear and present dangers of AI wildly exceed the bland attractiveness that we have addressed with our LoRA and in this essay. But the homogenization of visual culture is no small matter either. The black box of AI need not become an iron cage of eternal sameness.
Noam M. Elcott is professor of modern art history at Columbia University. His next book is The Social Portrait: Types and Antitypes in August Sander’s People of the Twentieth Century.
Tim Trombley is senior educational technologist at the Media Center for Art History at Columbia University.