How is data production different from data engineering?

Data engineering builds and maintains the infrastructure — pipelines, storage, serving layers, APIs. Data production decides what data flows through that infrastructure, for which customers, at what quality, and why. A data engineer ensures the pipeline runs. A data production leader ensures the pipeline produces something that solves a customer's problem.

How is data production different from data science?

Data science analyzes existing data to extract patterns, build models, and generate predictions. Data production decides what data should exist in the first place. Data science works with what's there. Data production determines what should be there — and builds the operational capability to produce it at scale.

Why does data production matter in the age of AI?

AI makes generating data easier, but it makes producing the right data more important, not less. As AI embeds data into more workflows and more decisions, the impact of data quality and relevance increases. Getting any data is becoming commodity. Getting reliable, trusted, use-case-specific data at scale — that is data production, and it is the capability that determines whether data actually drives better decisions.

What are the Five Languages of Data Production?

Data production requires fluency in five distinct domains: business and customers (use cases, jobs to be done), industry expertise (sector-specific data needs), data (connecting decisions to data types), technology (production methods, AI, infrastructure), and leadership (building teams, driving performance). No other function in a data organization bridges all five.

What Is Data Production? The Discipline Every Data Provider Needs

Q: What is data production?

Data production is the operational discipline of producing business-relevant data at scale for business customers. It encompasses deciding what data to produce, how to produce it, and how to organize for it — starting from customer use cases and working forward to the data. Data production sits between data strategy and data engineering and is the core capability of any data provider or Data-as-a-Service business.

Ask a data company what they produce and you’ll get a list of data types. Company data. Market data. Consumer data. Macro data. The answer sounds precise. It is not.

The real question is not what data do you produce. The real question is what customer problem does your data solve. And the distance between those two questions is where most data organizations lose their way.

Here is a pattern I have seen play out repeatedly. An organization builds a data product — say, detailed company profiles. They invest in coverage, depth, accuracy. They build clean pipelines. They launch it. Customers use it. So far, so good.

But the customer’s actual workflow requires company data and market sizing data and competitive landscape data and industry trend data — all connected, all consistent, all serving the same decision. The company profiles are excellent. The use case is half-solved. The customer churns, not because the data was bad, but because no one owned the question of what the customer actually needed to get the job done.

This is a data production problem. And the reason it keeps happening is that most organizations have never built the discipline that starts from the customer’s use case and works forward to the data — rather than starting from the data and hoping it finds a use case.

Exhibit 1

Data production starts from the customer, not from the dataset — the chain that drives every decision

The core logic: understand customers and their use cases first, then derive the data needed

Step 1 Customers Who are they? What industries? What roles?

Step 2 Use Cases What decisions do they make? What problems to solve?

Step 3 Workflows What processes do they run? What steps?

Your output Step 4 Data What data makes these workflows intelligent?

What data production is — and what it isn’t

Data production, as I practice it, is the discipline of producing business-relevant data at scale for business customers. It is the operational capability that sits between understanding what customers need and delivering the data that makes their work better.

This is, in fact, the core capability of any data provider and any Data-as-a-Service business. Yet most data companies do not name it, do not staff it as its own function, and do not build explicit processes around it. They have data engineers. They have data scientists. They have product managers. They have salespeople. But the discipline that connects all of these — the one that decides what data to produce, for whom, at what quality, and why — is either distributed across other roles or simply absent.

The core logic is a chain: you start with customers — who are they, what industries, what roles. You map their use cases — what decisions do they make, what problems are they trying to solve, what impact do they need. You understand their workflows — the actual processes they run, the steps, the handoffs. And from that, you derive the data — what information would make those workflows intelligent, meaningful, and impactful.

This is not data engineering. Data engineering builds the infrastructure — the pipelines, the storage, the serving layer. Essential work. But a data engineer does not decide what data should flow through the pipeline or why a customer needs it.

This is not data science. Data science analyzes data to extract patterns and predictions. Also essential. But a data scientist works with data that already exists. Data production decides what data should exist.

And this is not producing training data for LLMs. That is a specific and important form of data work, but it is a different discipline with different customers, different quality requirements, and different production economics.

Data production is the layer that connects all of these. It answers the question that none of them own: given what our customers need, what data should we produce, at what quality, at what scale, and for whom?

The Five Languages of Data Production

What makes data production genuinely hard — and genuinely its own discipline — is that it requires fluency in five distinct domains, each with its own vocabulary, its own logic, and its own practitioners. No other function in a data organization bridges all five.

1. Business & Customers. You need to understand customers first: their target groups, their use cases, their problems, their decisions, their jobs to be done, the impact they’re trying to have. You need to speak the language of product managers and business leaders — value propositions, customer segments, retention drivers, competitive positioning. If you can’t articulate why a customer would pay for a specific data type, you have no business producing it.

2. Industry Expertise. Customer use cases are not abstract. A consumer goods company evaluating market entry into Southeast Asia has a fundamentally different data need than a private equity firm screening acquisition targets. You need enough domain knowledge to understand which data points are relevant in which contexts — and enough humility to know when to bring in a specialist. You need to speak the language of industry experts.

3. Data. You need to understand what data drives decisions, what constitutes good data, what data exists in the world, and — this is the creative part — what data could exist if you built the right production methodology. This is not data analytics. It is a creative process of connecting business decisions to data. You look at a customer workflow and ask: what data point, if it existed at this quality and this granularity, would make this decision ten times faster or ten times better? You need to speak the language of data analysts — but with a producer’s mindset, not an analyst’s.

4. Technology. You need to understand how data can be produced — the methods, the tools, the tradeoffs. Modeling, surveying, collecting, scraping, synthesizing. You need to understand what AI makes possible and where it falls short, where human-in-the-loop is necessary, how taxonomy and classification systems work, how production pipelines scale or break. You don’t need to build the infrastructure yourself, but you need to know what’s feasible, what’s expensive, and what’s fragile. You need to speak the language of data engineers and data scientists.

5. Leadership. All of the above means nothing if you can’t build a functioning organization to execute it. You need to hire the right talent — and the talent for data production is rare, because it requires some fluency across all four other languages. You need to provide vision and purpose, because data production work can feel thankless when you’re on iteration fourteen of a taxonomy harmonization project. You need to drive performance across teams that think in very different terms. You need to speak the language of leaders.

I have never met anyone who is equally strong in all five. The best data production leaders I know — and I include myself in this honest assessment — are strong in two or three and compensating in the others through good hiring, good processes, and knowing which questions to ask even when they don’t have the expert answer themselves.

The reason this matters is that organizations routinely assign data production responsibilities to people who speak only one or two of these languages. A product manager who doesn’t understand production economics. A data engineer who doesn’t understand customer use cases. An industry expert who doesn’t understand scalable production. Each will make decisions that optimize for their language and create problems in the others.

Exhibit 2

Data production requires fluency in five distinct languages — no other function bridges all five

Each language has its own vocabulary, its own logic, and its own practitioners. Data production connects them.

The supreme discipline: use case orientation

It is easy to produce data. Pick a data type — company data, for example — and produce it in depth and detail. Cover millions of companies. Add financials, headcounts, industry classifications. Build comprehensive, accurate datasets. This is valuable work, and many companies do it well.

The hard thing — the thing that truly differentiates a powerful data production operation — is understanding multi-data-type customer use cases and providing a relevant share of the data necessary to solve them.

A customer running a competitive analysis doesn’t just need company profiles. They need company data combined with market data combined with consumer trend data combined with industry benchmarks — all consistent in taxonomy, all covering the same geographies, all at the right level of granularity. The customer doesn’t buy data types. They buy the ability to solve a problem.

This is why use case orientation is the supreme discipline of data production. It’s what drives customer impact and customer value. It’s what justifies the data’s price point. And it’s what differentiates a data production operation from any pure AI-generated or commodity data solution.

This matters more now than ever. In the age of AI, getting any data is becoming easy. Large language models can generate company descriptions, synthesize market overviews, and produce plausible-sounding statistics on demand. But getting the right data — reliable, trusted, current, relevant to a specific decision in a specific context — stays hard. It stays hard because it requires understanding which decisions the data serves, what quality threshold the customer’s workflow demands, and how different data types connect to form a coherent picture. An AI model can generate data. It cannot understand that a specific customer segment in a specific industry needs these five data types connected in this way to make this decision.

And here is the paradox: as AI makes data easier to consume and embed in workflows, data is influencing more decisions, in more organizations, more often. The relevance and impact of data is increasing in the age of AI, not decreasing. Which means the stakes for producing the right data — not just any data — are higher than they have ever been. Everyone who relies on data to make decisions must now actively work on ensuring that data is reliable, trusted, and relevant. That is the job of data production.

The understanding of customer use cases — and the operational capability to act on it at scale — is the moat. AI makes data production more powerful and more necessary, not less.

This also forces a strategic question that many data organizations defer for too long: do you sell data, insights, software, or use cases? The answer shapes everything downstream — your product, your pricing, your production priorities, your team structure. I will come back to this question in a later article, but it needs to be named here because it sits at the heart of what data production serves.

Why this matters more in the age of AI, not less

There is a common assumption that AI will commoditize data — that when any model can generate data, the value of producing it disappears. The opposite is true.

AI has made getting any data easy. You can prompt a model and receive company summaries, market overviews, industry analyses. The barrier to generating data has collapsed to nearly zero. But getting the right data — data that is reliable, trusted, current, relevant to a specific decision in a specific industry — remains as hard as ever. Harder, in fact, because the flood of generated data makes it more difficult to distinguish what is trustworthy from what merely looks plausible.

At the same time, data is influencing more decisions than ever before. AI tools are embedding data into workflows that were previously manual — from competitive analysis to investment screening to supply chain planning. As data enters more decisions, the impact of data quality and relevance increases. A wrong data point in a manual report is an embarrassment. A wrong data point flowing through an automated AI workflow at scale is a systemic risk.

This is why data production matters more now, not less. The organizations that actively work on getting the right data — that invest in the production discipline to ensure their output is reliable, relevant, and use-case-specific — will be the ones whose data is trusted. And trust, in a world drowning in generated content, is the scarcest asset of all.

How this unfolds: the decision cascade

If use case orientation is the objective, the question becomes: how do you build the capability to deliver on it?

Data production unfolds as a cascade of interdependent decisions. Each one constrains the next. Skip a step and the downstream decisions are ungrounded.

It starts with who: which customer segments do you serve? Then why: which of their use cases do you prioritize? This gives you a high-level answer to what: which data types — market, consumer, company, macro — you should produce, assessed as a portfolio against strategic, financial, and production attractiveness.

From there, you decide how to organize: what should be the leading dimension of your organization? Industries? Use cases? Functional excellence? Data types? Then how to produce: what production means do you employ? Do you model data, survey it, collect it, synthesize it? For each means, you build production processes — your factory, your machines, your production lines.

With the factory set up, you face a second layer of what: what do you produce with it? Working within your portfolio and production means, you go to the content level — industries, countries, topics. This is not a one-time decision. It is an ongoing prioritization, steered toward serving the most impactful use cases and creating customer value.

Then comes quality: how do you measure and improve the output? And the commercial connection: how do you stay close to customers and their evolving needs? How do you package and price what you produce? How do you work with Product and Sales to get it to market?

Each of these decisions gets its own article in this series. But the cascade matters here, in the first article, because it shows why data production is not a single skill or a single function — it is a chain of decisions, each requiring a different one of the five languages, each affecting every other.

Exhibit 3

Data production unfolds as a cascade of interdependent decisions — each one constraining the next

The sequence matters. Skip a step and the downstream decisions are ungrounded.

Who? — Part 2

Define target groups

Which customer segments, which industries, which roles?

Why? — Part 2

Prioritize use cases

Which customer problems, decisions, and workflows to serve?

What? — Part 2

Shape the data portfolio

Which data types to produce? Evaluate strategic, financial, production attractiveness.

Who builds? — Part 3

Organize for production

Team structure, leading dimensions, talent, capability gaps.

How? — Part 3

Build the production engine

Production means, methods, AI-enabled tooling, production lines.

What, specifically? — Part 3

Steer content production

Industries, countries, topics — ongoing prioritization within portfolio.

How good? — Part 4

Measure & improve quality

Quality frameworks, feedback loops, customer signal collection.

What to sell? — Part 4

Connect to commercial outcomes

Data vs. insights vs. software. Products, pricing, packaging, Sales.

Note: This cascade is not waterfall. In practice, you iterate between steps continuously. But the logic flows top-down: you can't decide what to produce without knowing who you produce for.

What’s next

This article defines data production and introduces two frameworks: the Five Languages and the Decision Cascade. The rest of this series goes deeper into each step.

Next up: The supreme discipline in practice — why customer and use case orientation is the single most powerful differentiator in data production, and what it looks like to organize an entire operation around it.

The Full Series

Data Production: From Customer to Market

Part 1 What & Why

1. What 'Data Production' Actually Means You are here

The discipline, the five languages, why it's its own function.

2. The Supreme Discipline: Customer & Use Case Orientation Next

Why the differentiator isn't data depth — it's use case coverage.

Part 2 The Strategic Decisions

3. Who Do You Produce For?

Defining target groups, mapping their use cases, and making the hard call on which ones to serve.

4. The Data Portfolio: Deciding What to Produce

An evaluation framework across strategic, financial, and production attractiveness.

Part 3 Building the Engine

5. Organizing for Production

Leading dimensions: industries, use cases, functional excellence, or data types?

6. The Production Engine: Means, Methods, and AI

How to actually produce data. AI-enabled tooling. Where humans stay in the loop.

7. Content Steering: What to Produce, and When

The ongoing prioritization within your portfolio.

Part 4 Running & Selling

8. Quality at Scale

How to measure and improve production quality.

9. Staying Close to Customers

How to collect the right signals to continuously inform production steering.

10. Data, Insights, or Software? Deciding What You Sell

The strategic choice that shapes everything downstream.

11. From Production to Market

Customer-facing products, pricing models, packaging, and Sales collaboration.

Common Objections

“Isn’t this just product management for data?”

Product management defines what to build based on user needs. Data production decides what to produce — under capacity constraints, across a multi-dimensional portfolio, balancing strategic fit, financial attractiveness, and production feasibility simultaneously. A product manager defines the what. A data production leader defines the what, the how, the quality tier, and the cost trade-off, all against a fixed capacity ceiling. In organizations that have both, they work in tandem. But they are not the same function, and conflating them is how the data portfolio ends up driven by the loudest customer request rather than a disciplined allocation.

“We have data engineers who make these decisions.”

Some do — and they’re the ones who end up becoming data production leaders. But in most organizations, engineers are measured on pipeline uptime and infrastructure cost, not on customer use case coverage or portfolio balance. Asking a data engineer to decide which data types to produce for which customer segments is like asking a factory floor manager to set product strategy. Some are brilliant at it. But it’s not their job, and the organization hasn’t built the process, the data, or the decision-making cadence to support it.

“AI is going to replace this. You can generate any data now.”

AI can generate data — that part is true, and it is getting easier every month. But generating data and producing the right data are fundamentally different things. A large language model can produce a plausible market overview on demand. It cannot understand that a private equity firm evaluating a mid-market industrial target in Germany needs company financials cross-referenced with industry growth data, management profiles, and regional labor market trends — all at the same level of granularity, all consistent in taxonomy, all current, all trustworthy enough to base a multi-million euro decision on. Here is the paradox: as AI makes data easier to consume and embed in workflows, data is influencing more decisions, in more organizations, more often. The stakes for data quality and relevance are rising, not falling. AI makes data production more powerful and more necessary — not less.