Perplexity Responds To Reddit Lawsuit Over Data Access

 Reddit has sued Perplexity AI for allegedly scraping Reddit content through Google results. Perplexity denies training on Reddit data, claiming it only summarizes posts with citations. Learn about the case, arguments, timeline, and impact on AI data access.

Perplexity Responds To Reddit Lawsuit Over Data Access

In October 2025, Reddit, the community-driven discussion platform, filed a federal lawsuit against AI startup Perplexity AI and three data-scraping firms. The core of the dispute: whether Reddit’s publicly-available posts can be harvested indirectly via Google search results and used for commercial AI purposes, without a licensing agreement or Reddit’s permission. This article provides a detailed, accessible explanation of the background, claims, defenses, stakes, and broader implications of the litigation.


Background: Why Reddit’s Data Matters

Reddit’s Position

Reddit hosts hundreds of millions of user-generated posts across thousands of subreddits. Its content spans deep discussions, advice, personal stories, niche communities — a rich, dynamic corpus of human language and interaction. In recent years, Reddit has negotiated licensing deals with major AI and tech firms (for example OpenAI and Google LLC) that give those firms access to Reddit’s data under terms agreed by Reddit. (Adweek)
From Reddit’s perspective, its data is an asset: it incurs costs to maintain, hosts user content expecting respectful use, and has an interest in monetising that content under its control.

The AI Industry’s Need for Data

At the same time, AI-driven “answer engines” or large-language-model (LLM)-powered systems rely on vast amounts of human-generated text to train or refine their responses. Reddit’s data is attractive since it reflects real conversation, varied topics, and nuance. One recent article described Reddit’s content as “the largest and most dynamic collections of human conversation ever created.” (Business Insider)
Thus, there is both commercial incentive and competitive pressure in the AI industry to obtain large, high-quality data sets — sometimes via licensed deals, sometimes via more opaque methods.

Past Precedent: Reddit’s Earlier Case

This is not Reddit’s first foray into litigation over AI/data use. In June 2025, Reddit sued Anthropic, the AI firm behind the Claude chatbot, alleging unauthorized access of Reddit’s content. (TechCrunch)
That prior case is relevant because it shows Reddit’s intent to protect its data and set precedent for how public-forum content may be used by AI firms.


What the Lawsuit Alleges

Defendants Named

In its complaint filed in U.S. District Court for the Southern District of New York, Reddit named:

  • Perplexity AI (the AI “answer engine”/search startup) (Adweek)
  • Oxylabs UAB (Lithuania-based data-scraping/proxy firm) (Built In)
  • AWMProxy (identified by Reddit as a “former Russian botnet”) (Built In)
  • SerpApi (Texas-based startup offering Google Search results scraping) (Search Engine Land)

Key Allegations

The complaint makes several central claims:

  1. “Industrial-scale” scraping: Reddit asserts the defendants circumvented its access controls, disguised their crawling activity (via proxies, masked identities, rotating IPs) and scraped Reddit content at very large scale. (AI Business)
  2. Via Google search results: Rather than directly scraping Reddit’s site (which may have anti-scraping protections), the complaint alleges these firms scraped Reddit-originated content indirectly from Google Search Engine Results Pages (SERPs). Reddit claims it inserted a test post visible only to Google’s crawler and shortly thereafter saw the content appear in Perplexity’s “answer engine,” indicating SERP scraping. (Search Engine Land)
  3. Violation of cease-and-desist & licensing norms: Reddit says it sent a cease-and-desist letter to Perplexity in May 2024 demanding it stop scraping Reddit’s data unless licensed. According to the suit, the volume of Reddit citations in Perplexity’s results then increased forty-fold after the warning. (Business Insider)
  4. Unjust enrichment and unfair competition: Reddit claims Perplexity’s business model — using Reddit’s content without authorization — has resulted in commercial benefit for Perplexity while Reddit (and its users) are uncompensated. (The Verge)
  5. Copyright law & access control: The suit contends Reddit’s content is copyrighted (or at least subject to policy protections) and the defendants’ conduct amounts to circumventing technological and contractual protections. (PBS)

Reddit’s View of Harm

In its filings Reddit points to several harms:

  • Loss of licensing revenue and undermined value of its data assets. (AI Business)
  • Increased costs for anti-scraping and protective measures. (Business Insider)
  • Undermining of Reddit’s business model and user trust (users may expect their posts not to be commercialised without permission).
  • Competitive disadvantage relative to licensed partners (like Google/OpenAI) if unlicensed actors freely use Reddit’s content.

Remedies Sought

Reddit is asking for:

  • Monetary damages (though unspecified amount). (Reuters)
  • A permanent injunction preventing the defendants from using, distributing or selling previously scraped Reddit content. (Search Engine Land)
  • Recognition of liability and deterrence of unauthorised data scraping in this domain.

Perplexity’s Response & Position

Summary of Perplexity’s Claim

Perplexity has publicly responded (via spokesperson Jesse Dwyer) that:

  • They “do not train foundational AI models using Reddit content.” (Business Insider)
  • Their “answer engine” summarizes discussions, provides citations (including Reddit threads), but is not built by ingesting Reddit posts wholesale for training. (Adweek)
  • They defend the principle of open access to knowledge, stating they will “fight vigorously for users’ rights to freely and fairly access public knowledge.” (Business Insider)
  • They view Reddit’s lawsuit as a threat to openness of the internet or as an attempt to “extort” licensing fees from the open-web. (Ars Technica)

Points of Contention

  • Perplexity denies training on Reddit posts, but Reddit alleges a test post (visible only via Google) appeared in Perplexity’s results, suggesting ingestion. (The Verge)
  • Perplexity says it respects robots.txt and summary/citation usage; Reddit claims it circumvented protections. (Business Insider)
  • Perplexity claims its citations to Reddit threads are akin to anyone linking or referencing Reddit content — a fair use of publicly-available posts — while Reddit contends the volume and method exceed ordinary referencing.

Legal and Technical Issues at Play

Scraping vs. Licensing

One of the key legal issues: when does accessing publicly-available web content require a license—or when does it become an unauthorised use? Reddit argues that although posts are publicly viewable, it imposes access controls, sells/licences its data, and thus automated large-scale scraping without permission is actionable. Perplexity argues that publicly-available content (e.g., accessible via Google) may be summarised and cited.

Circumvention of Protections

Reddit’s complaint emphasises that the defendants used deceptive techniques (IP rotation, disguising crawlers) to bypass access controls and avoided robots.txt rules. Technically, if a crawler mimics human browsing or uses proxies, it may evade detection — raising questions about whether this constitutes unauthorised access. (AI Business)

Copyright, Contract & Unfair Competition Law

  • Reddit’s argument includes copyright law (unauthorised copying of its posts). (Engadget)
  • Also, Reddit cites unfair competition/unjust enrichment — Perplexity allegedly profiting from Reddit content without compensation. (The Verge)
  • There may also be contract law issues if Reddit’s Terms of Service prohibit automated scraping or reuse of content, though that depends on Reddit’s user agreements and enforceability.

The Role of Search Engines & “Zero-Click” Results

A technical nuance: the data appears to have been accessed not directly from Reddit’s site in the usual way, but via Google’s search results pages (SERPs) that include Reddit content links or snippets. Reddit alleges the defendants scraped those SERPs rather than Reddit itself. That raises novel issues about whether scraping search results (which may lead to Reddit content) is the same as scraping Reddit directly. (Search Engine Land)

Fair Use or Transformative Use

Perplexity might argue its use is “transformative” — summarising and citing Reddit threads rather than reproducing them — which under U.S. copyright law may strengthen a fair-use defense. Reddit’s claim may push back that the scale, method, and lack of permission mean it is not fair use. The legal boundary here is yet to be fully defined.


Why This Case Matters

For Reddit and Platforms

  • Protecting data-monetisation: Reddit is signaling that its user-content is a monetisable asset, not simply freely reusable by any AI company.
  • Shifting business models: Platforms hosting large user-generated corpora may demand licensing or control over how their content is used.
  • Precedent: This case may set important precedent for how large-scale scrapers and AI systems treat public-forum content.

For AI Companies & Data Access

  • Access constraints: AI firms cannot assume publicly visible = freely usable at scale for commercial benefit.
  • Increased compliance risk: The litigation risk of scraping and re-using large volumes of third-party content is rising.
  • Business models challenged: If large datasets cannot be accessed cheaply/unilaterally, firms may have to pay licensing fees or rely more heavily on proprietary data.

For Internet Ecosystem, Search & Knowledge Tools

  • Google’s role: When search results themselves become data sources for AI, questions about ownership of those derived data sets arise.
  • Zero-click search: If AI answer engines rely on summarising publicly-available forum posts behind search results, websites may lose traffic and advertising revenue.
  • Openness vs proprietary: The case raises tension between open-internet ideals (free access to public knowledge) and protecting creator/host rights.

For Users

  • Privacy and consent: Even though Reddit posts are public, users might not expect industrial-scale harvesting for commercial AI without compensation or notice.
  • Ownership of user-generated content: The case highlights broader concerns about how user-content is repurposed by third parties.

Timeline of Key Events

DateEvent
May 2024Reddit sends cease-and-desist to Perplexity demanding data scraping stop unless licensed. (Business Insider)
Oct 22 2025Reddit files lawsuit in federal court in New York against Perplexity and three data-scraping firms. (MediaPost)
Oct 22 2025News outlets report that Reddit claims Perplexity’s citations of Reddit content increased ~40-fold after the warning. (Business Insider)
OngoingPerplexity plans to defend the case; Reddit pursues damages and injunction.

Potential Outcomes and Scenarios

Outcome A: Reddit Prevails

  • Court finds that Perplexity (and/or its scraping providers) breached Reddit’s rights, awarding damages and issuing injunctions.
  • That would strengthen platforms’ bargaining power over data licensing; AI firms may face more friction and higher costs accessing forum-based content.
  • Possibly enforcement action or industry shift to more formalised licensing for data previously considered freely accessible.

Outcome B: Perplexity (or defendants) Prevail or Settlement

  • If Perplexity convinces the court it only summarised and cited Reddit threads (fair use) and didn’t infringe, then scraping via SERPs may be treated as permissible.
  • Could limit platforms’ ability to assert control over publicly-viewable content; might favour open-data access and summarisation business models.
  • Might lead to increased reliance on public forum content by AI firms, with fewer licensing constraints.

Outcome C: Settlement-Driven or Hybrid

  • The parties may settle: Perplexity might agree to some licensing or revenue-share with Reddit, or adjust business practices (e.g., limit citations, pay a fee).
  • A hybrid outcome might emerge where scraping remains possible but with clearer rules, transparency obligations, or revenue sharing.

Broader Legal and Industry Implications

Data Licensing & Monetisation

Reddit’s case highlights an emerging model in which user-generated content platforms negotiate licensing deals (rather than relying solely on ad revenue). Reddit has existing contracts with Google and OpenAI. (Barron's) Platforms will increasingly ask: “Who is using our data? Under what terms? Are we compensated?”

AI Training Practices Under Scrutiny

AI companies are under increasing pressure regarding how they obtain training data — from ethics, fairness, copyright, and business-model perspectives. The notion of “free for the taking” publicly available web content is being questioned.

Search Engines, Zero-Click and Traffic Models

As AI answer engines proliferate, users may not click through to original sites (zero-click), reducing traffic to forums and blogs. Platforms like Reddit may see diminished ad revenue, prompting them to restrict or monetise access in different ways.

Legal Precedent around Web Scraping

Courts may further define the boundary between “scraping” (automated harvesting) and “linking/summarising”. The role of access controls, robots.txt, and technological restrictions may become more central to future disputes. The filter vs crawler debate may intensify.

User-Generated Content: Ownership and Expectations

From the user perspective, issues of consent, monetisation, reuse and transparency are important. If my Reddit post is publicly visible, can someone commercialise it in bulk? Platforms may need clearer disclosures or user-consent mechanisms.


Key Questions to Watch

  • Will the court treat scraping via SERPs (Google results) as the same as direct scraping of a site? Reddit’s “marked post” test is a novel piece of evidence. (Search Engine Land)
  • How will the court interpret “fair use” in this context? Summary and citation vs ingestion and training are central distinctions.
  • Will platforms be considered to have a licensing business model for what was once “free” web content?
  • What obligations (if any) will AI firms have to disclose their data-sources and data-licensing status?
  • Will user-generated content become more restricted or monetised by platforms?

Arguments & Counterarguments (Simplified)

Reddit’s Arguments

  • Reddit’s posts, while publicly viewable, are the product of its investment and community; it has rights and business interest.
  • The defendants bypassed Reddit’s protections via proxies, bots and disguised scraping — not legitimate access.
  • The large-scale nature of scraping differentiates commercial-grade ingestion from regular browsing/reference.
  • The defendants’ conduct has harmed Reddit’s business and undermined its licensing model.

Perplexity’s (and similar firms’) Counterarguments

  • Reddit’s posts are publicly accessible; summarising and citing them falls under fair use and linking norms.
  • Preventing summarisation undermines the open internet and public access to knowledge.
  • AI systems routinely aggregate publicly-available content; singling out Reddit is unfair.
  • If Reddit wants compensation, it should negotiate with every summarisation engine; but that would dramatically restrict knowledge flows.

What This Means for You (User, Developer, Platform)

  • As a user: Be aware that the posts you share publicly may be harvested by AI systems or summarised for third-party use — even if indirectly. It raises questions of consent, credit and compensation.
  • As a developer/AI practitioner: Getting large-scale web text is not necessarily free of legal risk. Review your data-acquisition practices, licensing status, and whether your system respects site policies.
  • As a platform: If you host user-generated content, consider whether your terms of service, licensing strategy and access controls are aligned with your business model. If your data is used commercially by others, you may want to assert more control or compensation.

Limitations and Uncertainties

  • Lawsuits take time: The case will unfold over months or years, and the legal precedent may evolve.
  • Much rests on factual details: exactly how Perplexity obtained Reddit content, how many posts were ingested, what was done with them, etc. Reddit’s evidence (e.g., the “marked post”) is compelling, but still must be proven in court.
  • The area of AI training data is evolving: legal frameworks (copyright law, contract law, data protection) are still catching up with technology.
  • A settlement could mean fewer clarifying judicial opinions, leaving ambiguity.
  • The outcome may differ across jurisdictions; while this case is U.S.-based, global AI/data operations will pay attention.

Final Thoughts

The lawsuit by Reddit against Perplexity and associated data scrapers underscores a shift in the digital economy: the explosion of AI systems has made “publicly-available” web content far more commercially valuable and contested. What once might have been viewed as fair game for crawling, scraping or “just linking to” is now being viewed by platforms as an asset that must be protected, licensed or monetised.

For AI firms, the message is clear: scale matters. It’s one thing to link or summarise a few threads; it’s another to ingest millions of posts for model training without permission. For platforms, the message is also clear: your data may have value beyond advertising or direct user engagement — and you may need to assert that value. For users, this is a reminder that public visibility does not always equal free-commercial reuse without recognition or compensation.

How the courts interpret this case — especially with regard to scraping via SERPs, summarisation vs ingestion, and automated large-scale data harvesting — may set important precedents for years to come. Watching this case will give insight into the evolving balance between openness of the internet, rights of content-hosts, and the commercial ambitions of AI.

Frequently Asked Questions (FAQs)

Q1. Why did Reddit sue Perplexity AI?
Reddit sued Perplexity AI for allegedly bypassing protections and scraping Reddit content via Google search results. Reddit claims the company used its data without authorization for commercial purposes.

Q2. What does Perplexity AI say in its defense?
Perplexity AI denies training its models on Reddit content. It says its system only summarizes posts with proper citations and operates under fair use, not violating Reddit’s terms.

Q3. How did Reddit discover the alleged scraping?
Reddit claims it created a test post visible only to Google’s crawler, which later appeared in Perplexity’s responses—suggesting Perplexity’s access came via Google SERPs, not direct API use.

Q4. Who else is involved in the lawsuit?
Along with Perplexity, Reddit’s lawsuit names Oxylabs, AWMProxy, and SerpApi as co-defendants, accusing them of aiding large-scale data scraping operations.

Q5. What could happen if Reddit wins the case?
If Reddit prevails, AI companies might face tighter restrictions on data scraping, higher licensing costs, and clearer legal boundaries for using public content in AI systems.

Q6. Why is this lawsuit significant for the AI industry?
The case could set a major precedent for how AI companies access public data, balancing fair use with content ownership and licensing rights.

Post a Comment

0 Comments