The Workflow Authoring Problem

Leave a private comment

AI can write entire applications overnight and trick restaurant hosts into thinking they're talking to a real person. I've made AI job search for me, and browse Craigslist for used cars.

But most of what AI can do isn't accessible to regular people, like my mom and dad. I built a library called WorkflowSkill to see if natural language could change that. Frankly, I'm not sure WorkflowSkill is the answer. But building it taught me things I wasn't expecting, and I wanted to publish what I've learned so far.

Feedback appreciated!

If you're reading this, I would LOVE to hear your feedback! You can highlight text and comment on this post, or reach out elsewhere!

Mom and Dad

Like so many people, I've been playing around to see how I can use AI to automate my work, both personal and professional. You may have seen my article on how I used n8n to process vast amounts of job postings and deliver a highly tailored report to my email every morning. If you haven't tried it: n8n is a visual editor for workflows. With it, you can build and connect nodes that will do work for you in a structured way.

On the other side of the autonomy spectrum, we have people chaining LLM inference together, resulting in agents that do work for you. These are your Cursor, Claude Code, and OpenClaw agents. We've all heard the stories about them writing entire apps overnight, tricking restaurants into thinking they're speaking with a real person, participating in AI social media, and starting their own religions.

Regardless of where your tools land on this spectrum, you can see everyone working towards a clear vision: to automate away the mundane, so we can focus on the most meaningful and impactful parts of our lives.

Though, It's got me thinking about my parents a lot. Sure, a full stack dev can spin up OpenClaw, edit json configs, and tune skills to get it to work for them. But how could a paint store owner and a nurse with no technical background use these tools to better their lives?

I call this the "mom and dad" problem. How can mom and dad use it?

Right now, they can't. The reality is the vast majority of the population is yet to experience any benefit from AI.

Workflows

When I think about what my parents would actually automate, it's all the same shape. Find and book cheap flights to Arizona. Check Craigslist once an hour for a used car I want to buy and send a message to the seller. Find a nice restaurant with a table for two at 8pm and book it. Pull together a grocery list from this week's meal plan and order it.

These tasks share traits:

  • Structured: the work is predictable and can be defined ahead of time.
  • Multi-step: useful automation chains together multiple tasks.
  • Repetitive: they need to happen on a schedule or in response to a trigger, not just once.
  • Action-oriented: the value comes from doing something (fetching a page, comparing prices, sending an email), not from open-ended reasoning.

I'd call these workflows. They represent the vast majority of what regular people want from AI. Not philosophical conversations. Not creative writing. Just: do this thing for me, reliably, over and over. This is aligned with Anthropic's own definition of workflows in their "Building Effective Agents" guide.

The data backs this up. A 2025 Stanford HAI study surveyed 1,500 U.S. workers across 104 occupations and found that people overwhelmingly want AI to automate repetitive, low-value tasks like scheduling, file maintenance, routine processing. 69% of positive automation responses cited freeing up time for higher-value work as the primary motivation.

Anthropic's own research on labor market impacts makes the gap even starker. Their "observed exposure" metric compares what AI could theoretically automate against what it actually automates in practice. The gap is enormous. Office and administrative tasks, for example, have over 90% theoretical coverage but a fraction of that in real usage. Computer and math occupations sit at 94% theoretical feasibility but only 33% actual coverage. The capability exists. The accessibility doesn't. That delta is the "mom and dad" problem expressed in data.

The important thing about workflows is that they don't inherently require intelligence. Fetching a web page, parsing HTML, filtering results, sending a notification. These are deterministic operations. You might want an LLM somewhere in the pipeline to evaluate whether a car looks good from its photos. But the orchestration itself? That's just code.

The problem is that while workflows excel at structured, repeatable tasks, they have poor ergonomics. That's what I tried to solve with WorkflowSkill.

The Rise (and Probably Fall) of WorkflowSkill

I tested OpenClaw on what should be an ideal use case: browsing Craigslist for a used car on my behalf, scheduled hourly. It went horribly. Different filters every time. Sometimes forgot to apply them. Got lost navigating pages. Reports formatted differently each run. Occasionally it worked great. In a few hours, I'd burned through $30 in Anthropic credits.

This is exactly the kind of task that would add real value to my parents' lives, which is why I was testing it.

We could solve it with an n8n flow. I've done that before. All orchestration (page fetching, HTML parsing, filtering) handled deterministically. Zero inference. Cheap and reliable. And n8n isn't the only option. Visual workflow builders like Rivet and Vellum offer similar power with better collaboration features, and agent SDKs from Anthropic, OpenAI, and AWS (Strands) provide code-first orchestration for developers who want maximum control.

But building workflows in this way is technical and time-consuming. My parents can't build one.

My hypothesis: there's a better paradigm, where workflows are authored using natural language.

I built WorkflowSkill to test this. It's a YAML-based workflow language with a TypeScript runtime. The agent defines steps — tool calls, LLM prompts, data transforms, conditionals — and the runtime handles execution and observability.

Key components of the architecture are as follows:

  • Workflow language: Five step types (tool, llm, transform, conditional, exit) designed to be easy for an agent to generate as syntactically valid YAML. The step-chaining model maps directly to several of the composable patterns described in Anthropic's "Building Effective Agents" guide: prompt chaining (steps connected by output references), routing (conditional branching), and evaluator-optimizer (the validate/run/iterate authoring loop). The language is deliberately minimal. Complexity comes from composition, not vocabulary.
  • Skill extension: Workflows are embedded as code-fenced YAML inside a SKILL.md file, the same format used by OpenClaw, Claude Code, and the emerging Agent Skills specification. This means a system that hasn't adopted the WorkflowSkill runtime can still read and interpret the workflow from the Markdown alone. The YAML is self-describing. The design intent is interoperability over lock-in: when workflow definitions are portable across agent ecosystems, authors can build once and consumers can run anywhere. This lifts the entire industry.
  • Tool abstraction: WorkflowSkill's ToolAdapter interface decouples workflow logic from tool implementation. A tool step can invoke built-in tools, custom functions, or any MCP server endpoint. The runtime expects the parent agent system to inject available tool context, so the agent can author workflows using whatever tools the host environment provides, without WorkflowSkill needing to know about them in advance.
  • Evaluator-optimizer workflow authoring: The runtime exposes a validate tool that the agent calls after generating a workflow. Validation catches structural errors (malformed references, type mismatches, missing outputs) before the workflow ever executes. The agent reviews the validation output, fixes the issues, and re-validates until the workflow is sound. The evaluator here is deterministic, not model-based, making it free and objective. In practice, this means most structural problems are resolved without human involvement. The ceiling the agent hits isn't syntax or structure; it's domain knowledge, like knowing an API needs rate limiting.
  • Eval-driven skill iteration: The runtime includes a test suite that measures authoring skill performance: given a specific task description, does the agent generate a workflow that matches the known-good structure? Each test targets a specific part of the language spec (loop handling, error recovery, output threading) so regressions are caught per-feature. This follows Anthropic's guidance on building tools for agents, which emphasizes that systematic evaluation and reviewing raw tool call transcripts is essential for improving agent performance, and that tool descriptions should be refined based on eval results rather than intuition.

I also built an OpenClaw plugin to try it within the most hyped agent ecosystem of the moment.

It totally works. You can converse naturally with an agent and get effective, robust workflows. But the hard reality: it still takes significant technical instruction and iteration to get complex tasks performing well. The agent generates syntactically valid YAML every turn, but knowing, for example, that you should rate-limit your API calls isn't obvious to an agent authoring the flow. With WorkflowSkill, as it stands today, I've found that production-grade results still require someone technical enough to say, "you're calling the API 100 times in a row. Add a 500ms delay."

The Author/Consumer Split

This finding forces a reframing. I had imagined a single user who describes a task and gets a working workflow. But if production-quality authoring requires technical skill, you get two personas:

Authors build workflows. They understand the tools, the data flow, the failure modes. They iterate until it's robust. This is technical work.

Consumers use finished workflows. Push a button, get a daily report. Don't care how it works. Shouldn't have to. This part can be non-technical and conversational.

This split shows up everywhere. WordPress powers 40% of the web. Almost none of those site owners built their own theme. Shopify is the same. Someone designs the storefront template; the shop owner picks one and adds products. Creation and consumption are fundamentally different activities performed by different people.

It's exactly what I did with the n8n job search flow. I spent hours building and debugging it, then open-sourced it. Others adopted it with minimal effort. I was the author; they were the consumers.

The AI automation space is formalizing this pattern. ClawHub is a marketplace of community-built OpenClaw skills. n8n has similar. Even Apple Shortcuts does the same: power users build complex automations, everyone else downloads and taps "run."

The platforms are discovering what WordPress already knew: most people don't want to build. They want to use what someone else built.

Is the split inevitable?

There's a strong case it's inherent to the problem. Complex tasks have complex failure modes. Anticipating them requires experience with the systems involved. You often can't avoid watching the workflow fail and troubleshooting why.

If true, the right move is to embrace it: make authoring as powerful as possible for technical users, consumption as effortless as possible for everyone else, and build the bridge between them.

Or is it solvable?

But some things nag at me. I'm not sure I'm ready to give up.

WorkflowSkill is a custom YAML language that exists in zero training data. I made it up days ago. Of course the agent struggled with edge cases. When you consider it's writing runnable workflows in a language it just learned, it's actually impressive. And, it's improving rapidly via eval-driven skill iteration.

I've been thinking, what if the agent wrote workflows in a well-established framework massively represented in its training data? Something with thousands of open-source examples, production implementations, and Stack Overflow answers? The competence gap between a custom DSL and an established framework could be the difference between "needs a human to debug" and "handles it alone."

Now layer on: what if the agent had access to a library of proven, production-tested workflows to reference before writing new ones? Not installing templates. Reading them. Learning patterns. If you want a flight price monitor and the agent can study three existing monitors that already handle rate limiting, retries, and notification formatting, it's not writing from scratch. It's adapting known-good patterns. That's how experienced developers actually work.

And what if the runtime had durability and error recovery built in, so the author could focus on the workflow logic rather than infrastructure concerns?

Strong priors from pre-training plus high-quality reference implementations plus native durability and error recovery, maybe the skill gap shrinks enough that a non-technical person can iterate their way to a production-quality workflow through conversation alone.

This is yet to even mention building better iteration into the authoring process. The runtime already has tools to validate the schema, and it's built into workflow iteration. But if the agent can truly sandbox and pressure-test the workflow before running it live, it's more likely to reach production quality without intervention.

So there's still hope. Mom and dad may author workflows with natural language someday.

What I think I know

  • Most of what regular people want from AI is workflows. Structured, multi-step, repetitive tasks. Not chat. Automation. But we face a dichotomy of challenges:
    • Pure agent-based approaches are too expensive and unreliable for these tasks today. Letting an LLM improvise every step of a repeatable process is the wrong tool.
    • Pure workflow approaches are too technical for non-technical people. n8n works brilliantly, but requires a mental model of data flow that most people don't have and shouldn't need.
  • Natural language authoring is promising but not yet sufficient. An agent can generate valid workflows from conversation, but production-grade results still need technical judgment to refine.
  • The author/consumer split is a natural pattern. It shows up everywhere in technology, and fighting it entirely may be less productive than designing for it. But the split may not need to be as wide as it is today. Better frameworks, richer reference libraries, built-in durability, and smarter iteration tooling could shrink the gap between "needs a developer" and "a conversation gets you there."
  • The winning platform will nail the bridge between authors and consumers. Making workflows easy to share, discover, customize, and trust is at least as important as making them easy to build.

What I'd Love to Hear

I'm publishing this half-formed on purpose. If you're thinking about these problems too:

  • Has anyone found a framework/environment where agents can author/perform workflows in reliably? I tested a custom DSL and hit a ceiling. I suspect something with deep training data representation performs better, but I haven't proven it.
  • Is the author/consumer split permanent, or do you think it's an artifact of current ergonomics?
  • What's the right abstraction level for consumers? Should they see steps and tweak them, or should the whole workflow be a black box behind a single sentence of intent?
  • Are there platforms I'm missing? I've looked at n8n, OpenClaw, and a handful of others, but this space moves fast.

And if you're a non-technical person who's tried to automate something with AI, I especially want to hear from you. The "mom and dad" problem doesn't get solved by builders talking to other builders.

Leave a Private Comment

Your comments are private and only visible to me.

Select any text above to comment on a specific passage, or use the form below for general thoughts.