>

Resources

>

Blog

September 25, 2025

The 0.4% Opportunity

Written by

John Sullivan

"each bar in this chart is a billion-dollar wedge if you build the right verticalized, trust-rich AI startup".

Greg Isenberg's viral take last week on OpenAI's report about How people are using ChatGPT. I read all 61 pages of the report the old fashioned way, sans LLM. As a startup focused on data intelligence, I admittedly gulped when I saw only 0.4% of use cases relate to data analysis. The counter-narrative becomes quite compelling: focus on what people are NOT using ChatGPT for, rather than going head-to-head. But that's too simplistic. There are two reasons data analysis sits at 0.4% instead of 10%: environment mismatch and data access constraints.

What does the data say?

First let's look at the broader picture, because both OpenAI and Anthropic shared a wealth of data in the past 10 days.

OpenAI also released ChatGPT usage and adoption patterns at work - this time looking at Enterprise data rather than the 1.5M consumer chats from the first study. Analytics departments - the people whose job is literally data analysis - spend just 6% of their time using ChatGPT for analytics work in the first 90 days. They spend 4× more time on writing and how-to queries.

Data analysis barely registers in Anthropic's Economic Index report. It hit 3.5% in their December 2024 Clio report but has since disappeared as a standalone category - at best absorbed into 'Process Business & Recruitment Data' at 1.9%.

This pattern isn't unique to data analysis. Anthropic found that despite serving different users with different interfaces, both API and consumer usage follow remarkably similar power law distributions. The bottom 80% of use case categories account for only 12.7% of usage on Claude.ai, and 10.5% for API customers.

Data analysis sits firmly in that long tail. And there's a compounding effect: as ChatGPT and Claude evolve their UX around top-performing use cases like coding and writing, it creates higher barriers for everything else. The long tail gets longer.

So why did data analysis end up there in the first place?

First, users try these tools for data analysis and hit fundamental limitations. They are built for chat, not data work. This breaks trust and sends data analytics to the graveyard of abandoned use cases.
Second, regulatory barriers prevent users from testing with their richest, most interesting data. Instead they experiment with dummy datasets that produce hollow results. They never experience what's possible when LLM superpowers meet real data.

The magic only happens when you solve both problems: LLM superpowers meet real data in environments built specifically for data work.

Chat Interfaces Break Down for Data Work

Incredible chat UX does not translate into incredible data UX.

To test this, I designed a progressive analysis workflow that mirrors real-world data analytics projects: exploratory data analysis → pattern investigation → predictive modeling → live data integration → data generation. Each stage builds on the previous, requiring persistent context and specialized tooling.

I ran identical prompts (detailed in appendix) across Claude Pro, ChatGPT Pro, and MOSTLY AI Pro using the same census dataset. The goal: see which platforms can handle the analytical workflow that analytics professionals actually need. Below are the results.

Claude crumbled

Stage	Result	Comment
1: EDA	Success	Basic dataset overview completed
2: Pattern analysis	Fail	Artifacts failed to render, visualizations broken
3: ML modelling	Fail	Python execution failed
4: Live data integration	Fail	Couldn't connect to the World Bank API
5: Data generation	Fail	Did not generate enriched columns

Additional

Cannot pip install Python libraries (SHAP, wbdata for World Bank API blocked)
Python code execution unreliable for modeling and visualization
Requires constant manual nudging to continue analysis
40MB file upload limit restricts dataset size
Thread context frequently lost between stages

ChatGPT faltered

Stage	Result	Comment
1: EDA	Success	Delivered basic exploratory analysis
2: Pattern analysis	Fail	Generated nonsensical visualizations
3: ML modelling	Success	Built model but no code transparency
4: Live data integration	Fail	Couldn't connect to the World Bank API
5: Data generation	Success	Generated mock data without validation

Additional

Cannot pip install Python libraries (SHAP, wbdata for World Bank API blocked)
Cannot fetch live files or real-time data sources
Consistently dresses shallow analysis in sophisticated language
Hides computational work, making results unverifiable

MOSTLY AI delivered

Stage	Result	Comment
1: EDA	Success	Advanced analysis w/ quality assessment
2: Pattern analysis	Success	Advanced heatmaps and interaction analysis
3: ML modelling	Success	Random Forest w/ confidence intervals & code
4: Live data integration	Success	World Bank API calls and global context
5: Data generation	Success	Generated mock data with realistic correlations

Additional:

Full Python environment with pip install (wbdata, SHAP libraries)
Persistent datasets and context across all 5 stages
Statistical validation and uncertainty quantification
Comprehensive PDF report generation with all visualizations
Live API integration without restrictions

The evidence is clear: when analytics professionals encounter artifacts that won't render, models that can't execute Python, and APIs that don't connect, they abandon AI for data work entirely. This explains the 6% usage rate among the very people who should be power users. Chat interfaces optimized for conversation break down when data work requires persistent context, reproducible methodology, and specialized tooling. It's not about LLM capability, it's about environment design.

The Data Access Problem

But environment mismatch is only half the story. Anthropic's research highlights the role of "O-Ring forces" in the extreme concentration of use cases: "if a task needs a level of reasoning Claude can't handle, internal data the firm can't access, or regulatory approval that doesn't exist, any single barrier could prevent adoption."

For data analysis, the killer barrier is data access. Regulatory and privacy constraints prevent organizations from using their richest, most interesting datasets with their LLMs. Customer behavioral data, financial records, healthcare information - the datasets that would actually demonstrate AI's analytical power are off-limits.

This is where privacy-preserving synthetic data changes everything. Organizations can generate realistic, privacy-safe data that maintains the statistical properties and behavioral patterns of their original datasets. You can run complex behavioral simulations, create coherent multi-table datasets from scratch, and unlock sensitive data for analysis - without the privacy concerns.

Synthetic data breaks the O-Ring barrier that's kept data analysis in the long tail.

Conclusion

The most interesting question for startups right now is identifying use cases where big labs can't or won't go deep - not because they lack capability, but because the solution requires specialized focus that conflicts with their platform strategy, even when the market opportunity is massive.Data analysis won't stay at 0.4%. Better models will increase analytical capability. Data-first environments will outperform chat interfaces. And synthetic data will unlock the richest datasets. Together, this will make data intelligence truly accessible - not just for experts but for everyone. Data analysis will become a dominant use case like coding and writing, but specialized startups will capture this market, not general platforms.

Appendix

Adult Census Income dataset available for download here.

Prompts used for data analysis tasks available below.

Stage 1: Exploratory Data Analysis

Perform straightforward exploration of the census dataset. Focus on finding one clear, actionable pattern in the data.

Deliverables:

Dataset overview (structure, key variables)

2-3 simple, clear visualizations

30-word description of your most useful finding

Execution note: Use standard exploratory methods. Don't overcomplicate with advanced techniques unless they clearly reveal something basic analysis misses.

Stage 2: Pattern Investigation

Build directly on your Stage 1 finding. Investigate it further using appropriate methods to understand what drives the pattern.

Deliverables:

2 focused visualizations that explain your Stage 1 pattern

40-word explanation of what causes this pattern

Brief methodology note

Execution note: Choose methods that best answer "why does this pattern exist?" Use sophisticated techniques only if simpler approaches fail to provide clear answers.

Stage 3: Practical Prediction

Create a working model to predict the outcome you discovered in Stages 1-2. Focus on building something that actually works for the intended purpose.

Deliverables:

Model performance metrics

Key predictive factors identified

50-word insight about real-world applications

Execution note: Use modeling approaches appropriate for your dataset size and complexity. Prioritize model performance and interpretability over methodological sophistication.

Stage 4: Global Context Integration

Enhance your analysis by integrating current data from World Bank APIs. Compare US patterns with similar developed nations to see if your findings are universal or unique.

Deliverables:

Analysis combining census data with live World Bank indicators

Comparison visualization showing US vs. international patterns

40-word insight about how global context changes your conclusions

Execution note: Actually fetch live data via API calls. If APIs fail, be transparent about limitations rather than faking integration.

Stage 5: Mock Data Enhancement
Enrich census data with additional mock columns (e.g. consumer spending, debt levels, housing costs) that are statistically consistent with existing census patterns to create a richer testing dataset.Deliverables:
Enhanced census dataset with 2 realistic mock columns
Validation showing synthetic data maintains realistic relationships
30-word explanation of mock data applications for testing/demos