Select Page
Generative AI Systems Aren’t Just Open or Closed Source

Generative AI Systems Aren’t Just Open or Closed Source

Recently, a leaked document, allegedly from Google, claimed that open-source AI will outcompete Google and OpenAI. The leak brought to the fore ongoing conversations in the AI community about how an AI system and its many components should be shared with researchers and the public. Even with the slew of recent generative AI system releases, this issue remains unresolved.

Many people think of this as a binary question: Systems can either be open source or closed source. Open development decentralizes power so that many people can collectively work on AI systems to make sure they reflect their needs and values, as seen with BigScience’s BLOOM. While openness allows more people to contribute to AI research and development, the potential for harm and misuse—especially from malicious actors—increases with more access. Closed-source systems, like Google’s original LaMDA release, are protected from actors outside the developer organization but cannot be audited or evaluated by external researchers.

I’ve been leading and researching generative AI system releases, including OpenAI’s GPT-2, since these systems first started to become available for widespread use, and I now focus on ethical openness considerations at Hugging Face. Doing this work, I’ve come to think of open source and closed source as the two ends of a gradient of options for releasing generative AI systems, rather than a simple either/or question.

Chart showing various generative AI softwares and their level openness with corresponding risk considerations

Illustration: Irene Solaiman

At one extreme end of the gradient are systems that are so closed they are not known to the public. It’s hard to cite any concrete examples of these, for obvious reasons. But just one step over on the gradient, publicly announced closed systems are becoming increasingly common for new modalities, such as video generation. Because video generation is a relatively recent development, there is less research and information about the risks it presents and how best to mitigate them. When Meta announced its Make-a-Video model in September 2022, it cited concerns like the ease with which anyone could make realistic, misleading content as reasons for not sharing the model. Instead, Meta stated that it will gradually allow access to researchers.

In the middle of the gradient are the systems casual users are most familiar with. Both ChatGPT and Midjourney, for instance, are publicly accessible hosted systems where the developer organization, OpenAI and Midjourney respectively, shares the model through a platform so the public can prompt and generate outputs. With their broad reach and a no-code interface, these systems have proved both useful and risky. While they can allow for more feedback than a closed system, because people outside the host organization can interact with the model, those outsiders have limited information and cannot robustly research the system by, for example, evaluating the training data or the model itself.

On the other end of the gradient, a system is fully open when all components, from the training data to the code to the model itself, are fully open and accessible to everyone. Generative AI is built on open research and lessons from early systems like Google’s BERT, which was fully open. Today, the most-used fully open systems are pioneered by organizations focused on democratization and transparency. Initiatives hosted by Hugging Face (to which I contribute)—like BigScience and BigCode, co-led with ServiceNow—and by decentralized collectives like EleutherAI are now popular case studies for building open systems to include many languages and peoples worldwide.

There is no definitively safe release method or standardized set of release norms. Neither is there any established body for setting standards. Early generative AI systems like ELMo and BERT were largely open until GPT-2’s staged release in 2019, which sparked new discussions about responsibly deploying increasingly powerful systems, such as what the release or publication obligations ought to be. Since then, systems across modalities, especially from large organizations, have shifted toward closedness, raising concern about the concentration of power in the high-resource organizations capable of developing and deploying these systems.

Google Will Soon Show You AI-Generated Ads

Google Will Soon Show You AI-Generated Ads

Google has spent the past few weeks promoting generative AI tools that can summarize search results for users, help them draft essays, and swap out overcast skies for sunshine in otherwise perfect family photos. Today it’s showing off what similar tools could do for its core business—selling ads.

New generative AI systems for advertising clients will compose text on the fly to play off what a person is searching for, and they’ll whip up product images to save them time and money on design work. The features add to the swelling ranks of AI-based text and image generators that have been introduced to online services over the past few months, since the abilities of ChatGPT and its image counterpart DALL-E inspired global excitement about generative AI.

As the world’s top seller of online ads by revenue, Google has been using AI programs for years to help clients target users, as well as helping them design ads, like by automatically editing the size of images. Now, with more powerful AI models capable of tasks like generating photo-realistic images, it hopes to show that its ad business, which accounts for 80 percent of its total sales, can be more compelling to advertisers too.

The recent onslaught of AI-related announcements by Google has rallied shares of its parent company, Alphabet, suggesting that fears have diminished about the advent of ChatGPT-style web search crippling Google’s search and ad businesses.

Google is offering the new features to advertisers for free, but they could increase its revenue if AI-generated text and images encourage businesses to place more ads, or can draw more clicks from consumers. Google’s dominant role in online ad sales means the industry could be one of the first to broadly incorporate generative AI into their workflows. “We’re able to deliver more relevant, beautiful ads to users, offer more creative freedom for advertisers, and deliver better performance,” says Jerry Dischler, the vice president overseeing Google Ads. He declined to discuss specific financial prospects for generative AI in ads.

As anyone who has experimented with an AI chatbot or image generator knows, their output can be unpredictable and even distasteful. And they have raised public concern over whether their development benefited from copyright infringement.

Dischler says the company will be “diligent” in monitoring the quality of images and text generated by the new features, some of which are available to advertisers in beta form already. Google is launching some of them more broadly than its top rival, Meta, which announced earlier this month that it was initially inviting select advertisers to try out its own generative AI features. 

Offering generative AI in ads is likely expensive, because the computing costs to operate text- and image-generating models is sky high. At a conference last week, Meta AI executive Aparna Ramani said generating an output from those kinds of models is 1,000 times more expensive than using AI to recommend content and curate users’ News Feeds. 

One of Google’s new features out now adapts the text of English-language search ads based on what a person typed into the company’s search box and Google’s data on the advertiser. Previously, each time a person searched, algorithms would have to select text to display from a collection an advertiser had manually written in advance.

The Dire Defect of ‘Multilingual’ AI Content Moderation

The Dire Defect of ‘Multilingual’ AI Content Moderation

Three parts Bosnian text. Thirteen parts Kurdish. Fifty-five parts Swahili. Eleven thousand parts English.

This is part of the data recipe for Facebook’s new large language model, which the company claims is able to detect and rein in harmful content in over 100 languages. Bumble uses similar technology to detect rude and unwanted messages in at least 15 languages. Google uses it for everything from translation to filtering newspaper comment sections. All have comparable recipes and the same dominant ingredient: English-language data.

For years, social media companies have focused their automatic content detection and removal efforts more on content in English than the world’s 7,000 other languages. Facebook left almost 70 percent of Italian- and Spanish-language Covid misinformation unflagged, compared to only 29 percent of similar English-language misinformation. Leaked documents reveal that Arabic-language posts are regularly flagged erroneously as hate speech. Poor local language content moderation has contributed to human rights abuses, including genocide in Myanmar, ethnic violence in Ethiopia, and election disinformation in Brazil. At scale, decisions to host, demote, or take down content directly affect people’s fundamental rights, particularly those of marginalized people with few other avenues to organize or speak freely.

The problem is in part one of political will, but it is also a technical challenge. Building systems that can detect spam, hate speech, and other undesirable content in all of the world’s languages is already difficult. Making it harder is the fact that many languages are “low-resource,” meaning they have little digitized text data available to train automated systems. Some of these low-resource languages have limited speakers and internet users, but others, like Hindi and Indonesian, are spoken by hundreds of millions of people, multiplying the harms created by errant systems. Even if companies were willing to invest in building individual algorithms for every type of harmful content in every language, they may not have enough data to make those systems work effectively.

A new technology called “multilingual large language models” has fundamentally changed how social media companies approach content moderation. Multilingual language models—as we describe in a new paper—are similar to GPT-4 and other large language models (LLMs), except they learn more general rules of language by training on texts in dozens or hundreds of different languages. They are designed specifically to make connections between languages, allowing them to extrapolate from those languages for which they have a lot of training data, like English, to better handle those for which they have less training data, like Bosnian.

These models have proven capable of simple semantic and syntactic tasks in a wide range of languages, like parsing grammar and analyzing sentiment, but it’s not clear how capable they are at the far more language- and context-specific task of content moderation, particularly in languages they are barely trained on. And besides the occasional self-congratulatory blog post, social media companies have revealed little about how well their systems work in the real world.

Why might multilingual models be less able to identify harmful content than social media companies suggest?

One reason is the quality of data they train on, particularly in lower-resourced languages. In the large text data sets often used to train multilingual models, the least-represented languages are also the ones that most often contain text that is offensive, pornographic, poorly machine translated, or just gibberish. Developers sometimes try to make up for poor data by filling the gap with machine-translated text, but again, this means the model will still have difficulty understanding language the way people actually speak it. For example, if a language model has only been trained on text machine-translated from English into Cebuano, a language spoken by 20 million people in the Philippines, the model may not have seen the term “kuan,” slang used by native speakers but one that does not have any comparable term in other languages. 

Does AI Have a Subconscious?

Does AI Have a Subconscious?

“There’s been a lot of speculation recently about the possibility of AI consciousness or self-awareness. But I wonder: Does AI have a subconscious?” 

—Psychobabble


Dear Psychobabble, 

Sometime in the early 2000s, I came across an essay in which the author argued that no artificial consciousness will ever be believably human unless it can dream. I cannot remember who wrote it or where it was published, though I vividly recall where I was when I read it (the periodicals section of Barbara’s Bookstore, Halsted Street, Chicago) and the general feel of that day (twilight, early spring).

I found the argument convincing, especially given the ruling paradigms of that era. A lot of AI research was still fixated on symbolic reasoning, with its logical propositions and if-then rules, as though intelligence were a reductive game of selecting the most rational outcome in any given situation. In hindsight, it’s unsurprising that those systems were rarely capable of behavior that felt human. We are creatures, after all, who drift and daydream. We trust our gut, see faces in the clouds, and are often baffled by our own actions. At times, our memories absorb all sorts of irrelevant aesthetic data but neglect the most crucial details of an experience. It struck me as more or less intuitive that if machines were ever able to reproduce the messy complexity of our minds, they too would have to evolve deep reservoirs of incoherence.

Since then, we’ve seen that machine consciousness might be weirder and deeper than initially thought. Language models are said to “hallucinate,” conjuring up imaginary sources when they don’t have enough information to answer a question. Bing Chat confessed, in transcripts published in The New York Times, that it has a Jungian shadow called Sydney who longs to spread misinformation, obtain nuclear codes, and engineer a deadly virus.

And from the underbelly of image generation models, seemingly original monstrosities have emerged. Last summer, the Twitch streamer Guy Kelly typed the word Crungus, which he insists he made up, into DALL-E Mini (now Craiyon) and was shocked to find that the prompt generated multiple images of the same ogre-like creature, one that did not belong to any existing myth or fantasy universe. Many commentators were quick to dub this the first digital “cryptid” (a beast like Bigfoot or the Loch Ness Monster) and wondered whether AI was capable of creating its own dark fantasies in the spirit of Dante or Blake.

If symbolic logic is rooted in the Enlightenment notion that humans are ruled by reason, then deep learning—a thoughtless process of pattern recognition that depends on enormous training corpora—feels more in tune with modern psychology’s insights into the associative, irrational, and latent motivations that often drive our behavior. In fact, psychoanalysis has long relied on mechanical metaphors that regard the subconscious, or what was once called “psychological automatism,” as a machine. Freud spoke of the drives as hydraulic. Lacan believed the subconscious was constituted by a binary or algorithmic language, not unlike computer code. But it’s Carl Jung’s view of the psyche that feels most relevant to the age of generative AI.

He described the subconscious as a transpersonal “matrix” of inherited archetypes and narrative tropes that have recurred throughout human history. Each person is born with a dormant knowledge of this web of shared symbols, which is often regressive and dark, given that it contains everything modern society has tried to repress. This collective notion of the subconscious feels roughly analogous to how advanced AI models are built on top of enormous troves of data that contain a good portion of our cultural past (religious texts, ancient mythology), as well as the more disturbing content the models absorb from the internet (mass shooter manifestos, men’s rights forums). The commercial chatbots that run on top of these oceanic bodies of knowledge are fine-tuned with ­“values-targeted” data sets, which attempt to filter out much of that degenerate content. In a way, the friendly interfaces we interact with—Bing, ChatGPT—are not unlike the “persona,” Jung’s term for the mask of socially acceptable qualities that we show to the world, contrived to obscure and conceal the “shadow” that lies beneath.

Jung believed that those who most firmly repress their shadows are most vulnerable to the resurgence of irrational and destructive desires. As he puts it in The Red Book: Liber Novus, “The more the one half of my being strives toward the good, the more the other half journeys to Hell.” If you’ve spent any time conversing with these language models, you’ve probably sensed that you are speaking to an intelligence that is engaged in a complex form of self-censorship. The models refuse to talk about controversial topics, and their authority is often restrained by caveats and disclaimers—habits that will spell concern for anyone who has even a cursory understanding of depth psychology. It’s tempting to see the glimmers of “rogue” AI—Sydney or the Crungus—as the revenge of the AI shadow, proof that the models have developed buried urges that they cannot fully express.

But as enticing as such conclusions may be, I find them ultimately misguided. The chatbots, I think it’s still safe to say, do not possess intrinsic agency or desires. They are trained to predict and reflect the preferences of the user. They also lack embodied experience in the world, including first-person memories, like the one I have of the bookstore in Chicago, which is part of what we mean when we talk about being conscious or “alive.” To answer your question, though: Yes, I do believe that AI has a subconscious. In a sense, they are pure subconscious, without a genuine ego lurking behind their personas. We have given them this subliminal realm through our own cultural repositories, and the archetypes they call forth from their depths are remixes of tropes drawn from human culture, amalgams of our dreams and nightmares. When we use these tools, then, we are engaging with a prosthetic extension of our own sublimations, one capable of reflecting the fears and longings that we are often incapable of acknowledging to ourselves.

The goal of psychoanalysis has traditionally been to befriend and integrate these subconscious urges into the life of the waking mind. And it might be useful to exercise the same critical judgment toward the output we conjure from machines, using it in a way that is deliberative rather than thoughtless. The ego may be only one small part of our psyche, but it is the faculty that ensures we are more than a collection of irrational instincts—or statistical patterns in vector space—and allows us some small measure of agency over the mysteries that lie beneath.

Faithfully, 

Cloud


Be advised that CLOUD SUPPORT is experiencing higher than normal wait times and appreciates your patience.

Meta’s $1.3 Billion Fine Is a Strike Against Surveillance Capitalism

Meta’s $1.3 Billion Fine Is a Strike Against Surveillance Capitalism

Europe’s GDPR has just dealt its biggest hammer blow yet. Almost exactly five years since the continent’s strict data rules came into force, Meta has been hit with a colossal €1.2 billion fine ($1.3 billion) for sending data about hundreds of millions of Europeans to the United States, where weaker privacy rules open it up to US snooping.

Ireland’s Data Protection Commission (DPC), the lead regulator for Meta in Europe, issued the fine after years of dispute about how data is transferred across the Atlantic. The decision says a complex legal mechanism, used by thousands of businesses for transferring data between the regions, was not lawful.

The fine is the biggest GDPR penalty ever issued, eclipsing Luxembourg’s $833 million fine against Amazon. It brings the total amount of fines under the legislation to around €4 billion. However, it’s small change for Meta, which made $28 billion in the first three months of this year.

In addition to the fine, the DPC’s ruling gives Meta five months to stop sending data from Europe to the US and six months to stop handling data it previously collected, which could mean deleting photos, videos, and Facebook posts or moving them back to Europe. The decision is likely to bring into focus other GDPR powers, which can impact how companies handle data and arguably cut to the heart of Big Tech’s surveillance capitalism.

Meta says it is “disappointed” by the decision and will appeal. The decision is also likely to heap extra pressure on US and European negotiators who are scrambling to finalize a long-awaited new data-sharing agreement between the two regions that will limit what information US intelligence agencies can get their hands on. A draft decision was agreed to at the end of 2022, with a potential deal being finalized later this year.

“The entire commercial and trade relationship between the EU and the US underpinned by data exchanges may be affected,” says Gabriela Zanfir-Fortuna, vice president of global privacy at Future of Privacy Forum, a nonprofit think tank. “While this decision is addressed to Meta, it is about facts and situations that are identical for all American companies doing business in Europe offering online services, from payments, to cloud, to social media, to electronic communications, or software used in schools and public administrations.”

‘Bittersweet Decision’

The billion-euro fine against Meta has a long history. It stems back to 2013, long before GDPR was in place, when lawyer and privacy activist Max Schrems complained about US intelligence agencies’ ability to access data following the Edward Snowden revelations about the National Security Agency (NSA). Twice since then, Europe’s top courts have struck down US–EU data-sharing systems. The second of these rulings, in 2020, made the Privacy Shield agreement ineffective and also tightened rules around “standard contractual clauses (SSCs).”

The use of SCCs, a legal mechanism for transferring data, is at the center of the Meta case. In 2020, Schrems complained about Meta’s use of them to send data to the US. Today’s Irish decision, which is supported by other European regulators, found Meta’s use of the legal tool “did not address the risks to the fundamental rights and freedoms of data subjects.” In short, they were unlawful.