• The Deep View
  • Posts
  • ⚙️ The dark side of scaling generative AI datasets

⚙️ The dark side of scaling generative AI datasets

Good morning, and welcome to your short week.

Today, we’re talking all about scale and AI, breaking down the argument that scale is “all you need” and exploring some of the impacts of this scaling mentality.

— Ian Krietzberg, Editor-in-Chief, The Deep View

In today’s newsletter:

AI for Good: Listening to the humpback whales

Source: NOAA

All whales use sound to communicate. 

Advances in hardware have more recently made it far easier for scientists to listen in on these underwater conversations — through long-range acoustic recordings, data collection itself wasn’t a problem. The problem was data analysis, with a 2021 study saying that “discoveries are now limited by the time required to analyze rather than collect the data.”

So they turned to AI: In 2021, scientists at NOAA partnered with Google to train a deep-learning convolutional neural network (CNN) on 187,000 hours of acoustic whale data that had been collected over the past 14 years. 

  • Male humpback whale songs have traditionally posed a challenge to automated tracking since they are long, complex and change every year. 

  • This model, however, was capable of identifying male humpback whale songs from several different populations over a 13-year timeline. 

Why it matters: We’ve talked before about scientists leveraging AI as a stethoscope with which to study marine life; here it is again, in action. Scientists are able to study the automated insights from these models to determine the movements and health of humpback whale populations. 

200+ hours of research on AI tools & hacks packed in 3 hours (Early July 4th Sale) US

The only AI Crash Course you need to master 20+ AI tools, multiple hacks & prompting techniques to work faster & more efficiently.

Just 3 hours - and you become a pro at automating your workflow and save upto 16 hours a week.

This course on AI has been taken by 1 Million people across the globe, who have been able to:

  • Build No-code apps using UI-ZARD in minutes

  • Write & launch ads for your business (no experience needed + you save on cost)

  • Create solid content for 7 platforms with voice command & level up your social media

  • And 10 more insane hacks & tools that you’re going to LOVE!

The truth about scaling laws

Source: Twitter

In a recent interview, Dario Amodei — the CEO of Anthropic — said that he believes artificial general intelligence is one to three years away. He believes that catastrophic AI risk could be one to three years away, as well.

  • The basis for this belief has to do with his impression that “like a human child learning and developing, (AI is) getting better and better.”

Aside from the fact that AI models do not learn like human children, what Amodei is essentially talking about is scaling laws. But this view that language models will keep getting better and better until we reach AGI “rests on a series of myths and misconceptions,” computer science professor Dr. Arvind Narayanan and Ph.D candidate Sayash Kapoor argued in a recent post

The details: Scaling laws show that as model size increases, they do get “better.” But Narayanan and Kapoor said that scaling laws only “quantify the decrease in perplexity” — essentially, how well a model predicts the next word. They do not, however, quantify any emergent capabilities of that model.  

  • They added that there is no evidence that the increase of these emergent capabilities alongside model size will continue indefinitely

  • “If LLMs can't do much beyond what's seen in training, at some point, having more data no longer helps because all the tasks that are ever going to be represented in it are already represented,” they said. “Every traditional machine learning model eventually plateaus; maybe LLMs are no different.”

The timing of that plateau, they said, is just going to be hard to predict. 

Amodei recently told Time that it would be “good for the world” if the effects of scaling laws stopped, as it would “restrain everyone at the same time.” 

  • Need a website? Generate one with AI in seconds using Mixo.

    • For this week only, get 20% off all website costs using offer code deepview20 at checkout.*

  • Figma will opt users into AI training by default (VentureBeat).

  • Fame, Feud and Fortune: Inside Billionaire Alexandr Wang’s Relentless Rise in Silicon Valley (The Information).

  • AI deals between Microsoft and OpenAI, Google and Samsung, in EU crosshairs (Reuters).

  • Failure to meet surging data center energy demand will jeopardize economic growth, utility execs warn (CNBC).

  • Exclusive: Microsoft Bing’s censorship in China is even “more extreme” than Chinese companies’ (Rest of World).

Bill Gates says scale is not the ‘big frontier’ in AI 

Source: Next Big Idea Club

In a recent interview, Microsoft co-founder Bill Gates chimed in on the idea of scaling laws and AI, agreeing that scale is not all you need. 

The details: Gates said “we have probably two more turns of the crank on scaling, whereby accessing video data and getting very good at synthetic data, we can scale up probably two more times.”

  • “That’s not the most interesting dimension,” he added. “The most interesting dimension is meta-cognition.” 

  • Meta-cognition — explored in this 2021 research paper — is more focused on better understanding human cognitive mechanisms and attempting to apply them to machines. As Gates described it: “Understanding how to think about a problem in a broad sense and be able to step back” like a human and reflect on it. 

Gates said that today’s AI is “so trivial that it’s just generating through constant computation each token in sequence. It’s mind-blowing that that works at all.” 

Cognitive scientist Dr. Gary Marcus called the clip a “breath of fresh air.” 

Marcus added, however, that he considers the investment required to achieve those “two more turns of the crank” a waste, saying that the industry shouldn’t wait to start exploring meta-cognition. 

Discover essential Apache Airflow recipes to kickstart your Generative AI projects.

Check out the top six reference architectures to help get you started. Get the cookbook

The dark side of scaling generative AI datasets 

Source: Waymo

In keeping with today’s theme on scaling laws, recent research from Mozilla found that the rush to continue adding more and more data to training sets is having a (possibly unintended) consequence: The disproportionate scaling of biased outputs. 

The idea, at its basic level, is that if a developer is consuming more data from the internet, there will be steadily larger portions of that data that contain racial, gender and other biases. And since model output is a result of its training data, the chances of biased output go up as well. 

  • These issues of bias exhibited by both text and image models have been well-documented by numerous researchers. 

The researchers in this particular paper — led by Dr. Abeba Birhane — audited 14 different image models for racial and gender bias using the Chicago Face Dataset (CFD) as a probe. 

The findings: For all models, the researchers found that as they scaled the pre-training data, “Black and Latin groups received higher probabilities of being predicted as ‘criminal’ compared to the other two groups: White and Asian.”

  • The paper found that “the models showed bias against Black and Asian women, and all men from the four racial groups (Black, Asian, Latin and White).”

  • The researchers stated that since the datasets they audited make up the backbone of many popular models, “these models are not purely intellectual exercises but result in direct or indirect impact on actual people, particularly marginalized groups.”

The researchers called for more transparency, open access to training sets and rigorous audits.

The line from this research paper that stands out to me the most is the authors’ assertion that these models are no longer “purely intellectual exercises.” They are being integrated into the real world, and they likely aren’t ready for it. 

  • The shift from the lab to society, where the impact of a failed experiment could now involve real harm against real people, is one that must be better appreciated. Scaling laws and the science behind generative AI factor heavily into that; since these models are “black boxes,” the developers who make and deploy them don’t really know how they work. And knowing how something works is a bit of a prerequisite for ensuring safety. 

At the very least, this research calls out a perspective that I have not often seen in the discourse around scaling laws and LLMs. Largely, as we see above, that discussion is focused on whether more data will result in a better model. 

But what is not often asked is what “better” means, who’s defining it as such and who is being actually impacted by these “better” models. 

“The AI industry has devoted little attention to the downstream impacts of scaling training datasets — and our research demonstrates this is a dangerous oversight,” Birhane said in a statement.

Which image is real?

Login or Subscribe to participate in polls.

A poll before you go

Thanks for reading today’s edition of The Deep View!

We’ll see you in the next one.

Your view on AI-generated content:

For around 35% of you, it’s “complicated.” But close to 20% of you said if it’s made with AI, it’s not for you. The same amount have no problem with consuming AI-generated content.

It’s complicated: 

  • “Depends on the content. I think a human should remain in the loop unless it it explicit everything is AI-generated without human review. Some content should remain overseen by a human — anything with the intent to deliver ‘facts.’”


  • “As long as it is quality content and useful it’s ok.”

Do you think AI deployment is moving too fast?

Login or Subscribe to participate in polls.