OpenLineage & AI: Why Lineage Matters for Data Agents

This conversation has been edited for length and clarity.

The 30-Second Splash

Ido: Most underrated skill in data?

Harel: Getting people aligned on definitions.

Ido: Dream vacation destination?

Harel: Cook Islands.

Ido: What's going to break most of the data pipelines in your environment tomorrow?

Harel: Anything you can think of.

Ido: One tool you cannot live without?

Harel: Last few months — Claude Code, for sure.

Ido: So you're committing a lot of code to OpenLineage. You're doing nothing — Claude Code commits all the things.

Harel: Exactly. I do zero work.

What is OpenLineage, and why does it exist?

Ido: For the small part of our audience that doesn't know OpenLineage yet, can you give us a quick overview?

Harel: OpenLineage is an open standard for emitting lineage metadata from data pipelines, where the work is actually happening. It's a few things bundled together. The core piece is a JSON specification that describes how you reason about a job, a job run, and the inputs and outputs of that job.

It also includes a bunch of clients in various languages — Python, Go, Java — for emitting OpenLineage events, plus out-of-the-box integrations for platforms like Spark and dbt. For Apache Airflow, the implementation is actually built directly into the project itself. A few other open-source projects have decided to build OpenLineage in the same way.

We're incorporated under the Linux Foundation's Data and AI division. And the reason it matters is simple: no one wants to write boilerplate integration code, even if it's "cheap" to do today with Claude. It's not really cheap — you still have to maintain it, evolve it, write point-to-point solutions for every single thing. Isn't it nicer to just have a spec that does this and everyone contributes to? That's really what OpenLineage is about.

How is AI affecting open-source maintainership?

Ido: Before we go deeper into OpenLineage, you're someone who's deep in open source. How is AI affecting how open source gets managed?

Harel: Some open-source projects have it harder than others. I know maintainers who are dealing with a huge stream of vibe-coded PRs — people pushing pull requests directly into projects without really understanding the code or what they're trying to do, just trying to get an open-source contribution on their checklist.

That creates real pressure on the maintainer. Anyone can drive by and post a PR, and now it's on you to figure out what they were trying to do, what they care about, and how much time you should spend unpacking it. Open-source maintainers aren't omnipotent. We're regular humans, and we don't necessarily get paid for it.

We've seen some of this in OpenLineage too. A couple of months ago we shipped an agents.md file in the repo to guide the coding agents that contributors are using. The first rule: do not ever make a change to the spec unless you know what you're doing. If you're evolving something, make sure there are no other consumers of it. Maybe the thing you're trying to change is already represented in the spec. We're trying to guide the agents in how they reason about the project and how they draft their pull requests.

That said, AI has been useful on the maintainer side too. We've been pushing a change for more explicit table-to-table lineage in the spec, and getting test cases drafted has been much easier — you can point an agent at the full spec and have it generate cases. I had a conversation with friends at OpenAI last week, and one thing they pointed out is that coding agents are really good at data models for open-source projects. They're trained on this stuff. So agents understand open source way better than they understand closed source — and that plays in our benefit.

What are the main real-world use cases for OpenLineage today?

Ido: Lineage is one of those terms people just drop into conversation: "I need lineage." Where do you see it actually being used?

Harel: Sometimes the answer to "why do you need lineage?" is just that people like looking at graphs of their pipelines — and that's not nothing. It instills trust. It tells people that you understand and capture the semantics of their pipelines, that their mental model exists in your product. I saw the same thing when I worked on Datadog's APM product: customers want to know you've captured their services and understood the microservice architecture they have. That alone gives them confidence that you understand what's going on.

The use cases I really care about are operational. When something goes wrong, lineage helps you understand quickly: how did it break, why did it break, what's the downstream impact, how severe is it? Lineage is also incredibly useful for propagating data quality issues — telling you whether something is wrong upstream of a critical dashboard or reporting table.

The other top category is compliance and governance: PII tracking across organizations, cost attribution, and dev-loop workflows where someone is about to push a change that would break downstream consumers. Operations and governance are the two big buckets.

How does AI change these use cases?

Ido: How do you see AI changing the lineage picture?

Harel: For me, it's an intensity thing. AI amplifies people — for better or worse. If you knew what you were doing and you're using AI well, you can do a lot more. If you didn't know what you were doing and you're not using AI well, you're going to do a lot more bad things faster. AI just really increases the signal on every interaction a human has with the data.

If someone is stumbling around a data pipeline or a data warehouse, asking a bunch of questions without good guidance, they're going to get a bunch of answers — and those probably aren't the right answers. The impact here is that having strong observability and strong context on those interactions becomes critical. From there, you can do interesting things: in your organization, a lot of people are asking these questions and getting this data — maybe you should tune how they interact with agents, create a materialized view, or build a centralized data model and inject that context into how people query the warehouse.

Lineage information helps you understand what's happening in your environment. So for me, it's really about intensity. But I'm curious how you're seeing it.

Ido: From the usage perspective, I agree. But the part I keep coming back to: when agents want to use the data in your platform, lineage is a key part of their ability to do that — because lineage represents the flow of information.

Agents need to know what the data means. They can use a metadata description someone wrote for them, or they can look at the code and the lineage to figure out how the data was produced. When humans use data, they pretty much have the mental model in their heads — they know how the data was created. But agents don't have a head. In order to really understand what the data means, they need to look back at how it was produced and which calculations it passed through.

So lineage becomes fundamentally important to give agents the ability to use data for analytics or any automation you want to do. Lineage will be required in essentially every agentic interaction where an agent wants to understand what it has in the data.

Is this already showing up in the open-source community?

Ido: Are you seeing this shift already affect open-source activity?

Harel: For sure. One thing I'm hearing from companies and people I talk to is that it's much easier now for people to dip their toes in — use an MCP server, interact with the warehouse, self-serve on the data. Things they wouldn't have done before, because they have more capabilities now. Or at least the perception of more power to do this on their own.

The impact is mixed. Sometimes people get it right. In many cases, they don't. You see a VP show up to a meeting with a slick analysis they did themselves — they didn't want to bother anyone, so they just generated it. They walk into the room with their numbers, and the VP of Data is sitting there going, "where did you get these numbers? This doesn't look right."

Now imagine if that analysis came with an appendix: "this data is from this table, which is calculated from this table, calculated from this and that." At least the audience would have a sense of security — okay, this is probably fine, they used qualified tables.

There's a product to be built around this. But the broader point is that the need for that kind of tooling has gone way up, just because it's easier to do — quote unquote easier. Easier to do, not necessarily easier to do right.

Should lineage include unstructured data?

Ido: A big shift with AI is the ability to access unstructured data. We tend to think of lineage as tables, columns, numbers — but organizations have tons of documents and folders we don't treat as part of the data flow. Should lineage include that?

Harel: Of course it should. But there's a real challenge.

The stochastic nature of large language models breaks an intrinsic part of the OpenLineage contract — or really, of how we reason about data lineage in general. Lineage is for compliance in some cases, and if it's for compliance, it has to be deterministic. You can't get two different results from scanning the same document, or from analyzing the same codebase to infer what it does. If sometimes you're getting different results, that's a real issue.

We need ways to reason about this for unstructured data that are different from just LLM-based heuristics. Maybe there are other ways. I haven't put enough effort into figuring it out individually, but there's a lot of work to be done here.

Ido: That's why we have a community. Anyone here who wants a super interesting and important problem to work on — unstructured data in OpenLineage. It's a real challenge.

Harel: I will definitely review that PR.

If you had an army of engineers, what would you build next?

Ido: If you had an army of engineers — good code, not AI code — what would you build next? Setting unstructured data aside.

Harel: This might sound boring, but I wouldn't deploy them on the OpenLineage project itself. I'd deploy them across each individual data processing technology out there to embed OpenLineage natively. I'd want committers from Spark to add the OpenLineage integration directly into Spark. Same for the long tail of data tools.

The ultimate goal is for the OpenLineage repository to contain just the spec and the client libraries — no integration code at all. That would also help with the maintainer burden we have today.

The second thing — this is something we haven't thought about enough in OpenLineage — is going deeper on machine learning, or model training, for AI. How does the spec work in that space? How do applied scientists and ML engineers use OpenLineage in their work? It's actually pretty similar to data engineering. The Ray community has done some work on capturing OpenLineage from Ray jobs. There's a real need there, and I'd love an army — or at least a battalion — of engineers to work on it.

Where is data engineering going in the next few years?

Ido: Where do you see data engineering going?

Harel: I don't even know where engineering is going in the next few years, let alone data engineering. But I'd say ownership of the business itself becomes more and more important. Understanding what you're actually trying to do.

Focus more on contracts and meaning. Delegate a lot of the execution and the nitty-gritty work to AI and to agents. Focus on what really matters — your company's business, the problems you're trying to solve. Step in when you need to make an executive decision. Because, as we've established, the AI doesn't think. It can provide a bunch of options, but you have to make the call.

I'd still very heavily review a spec proposal from Claude before merging it into OpenLineage, by the way. Especially in open source — everybody can see the code.

Closing takeaways

Ido: A few takeaways I'm leaving with:

Lineage is critical when you try to use agents. Agents have to use lineage to understand what the data means and how to use it. Humans carry the mental model in their heads — agents don't.

We need every data tool to adopt OpenLineage. It's critical that we have one way to look at the entire data flow, in one format, so we can build a real map of how data moves.

If you're looking for an open and important problem to work on, OpenLineage is a great place to start — especially around unstructured data and ML lineage.

Harel: That second point was a bit Lord of the Rings. One spec to rule them all, and in the darkness bind them.

Ido: That needs to be on the description of the project, for sure.

If you enjoyed this conversation, follow the Data Splash podcast. We have a lot more coming. Until next time — keep your data flowing.

Watch the full episode:

Youtube: https://youtu.be/TpbOOvGmbG0?si=h92heHU9axUVJqur
Spotify: https://open.spotify.com/episode/5hV6UgMXeirMtAy948VMet?si=62db22926bf94496

To follow the Data Splash:

Youtube: https://www.youtube.com/playlist?list=PLJboMtNQp-MtIK1v7fn6eA35Ylh2T85IN

Spotify: https://open.spotify.com/show/4MD8LSNbpzW46qeq3qptpr

How OpenLineage Is Becoming Infrastructure for the AI Era — A Conversation with Harel Shein

The 30-Second Splash

What is OpenLineage, and why does it exist?

How is AI affecting open-source maintainership?

What are the main real-world use cases for OpenLineage today?

How does AI change these use cases?

Is this already showing up in the open-source community?

Should lineage include unstructured data?

If you had an army of engineers, what would you build next?

Where is data engineering going in the next few years?

Closing takeaways

Related Articles

Your coffee can wait.
Your data can’t.

How OpenLineage Is Becoming Infrastructure for the AI Era — A Conversation with Harel Shein

The 30-Second Splash

What is OpenLineage, and why does it exist?

How is AI affecting open-source maintainership?

What are the main real-world use cases for OpenLineage today?

How does AI change these use cases?

Is this already showing up in the open-source community?

Should lineage include unstructured data?

If you had an army of engineers, what would you build next?

Where is data engineering going in the next few years?

Closing takeaways

Related Articles

Your coffee can wait. Your data can’t.

Your coffee can wait.
Your data can’t.