Behind the Build

Building customer-facing data products: A builder’s perspective

19 December 2025

Author(s)

Elisabeth Reitmayr

Head of Engineering - AI & Data

Table of contents

At Parloa, we build data products that help our customers understand the performance of their AI agents. Our users’ needs don’t only refer to the insights, but also their reliability and resilience. Our data products are built for customer service teams making high-stakes decisions in production environments, which means they need to be treated as mission-critical production systems with the engineering rigor that entails. And that demands a data platform designed for resilience.

This article gives an overview of the architecture and governance principles we’ve implemented to meet this challenge: leveraging a modern lakehouse pattern on Databricks, enforcing data quality with frameworks like DQX, adopting a flexible, conversation-specific Activity Streams 2.0 event taxonomy, and maintaining consistency through governance-as-code and a federated model.

Data products help us build trust in our platform

Trust is a prerequisite for our customers to rely on our platform: We manage autonomous systems for our customers, interacting with their customers. Data and insights help us to build and maintain trust with our customers, by helping them understand how well their agents perform.

Customer service operates in an inherently ambiguous environment: Context is partial, stakes are emotional, speed matters, and surface metrics may not reveal the whole truth.

A 30-second call duration could mean efficient problem resolution or a frustrated customer hanging up. Average handle time tells you nothing about whether the conversation was helpful, on-brand, or even coherent. This is why our data products capture interaction nuance, not just interaction metadata.

We need to surface the story behind the metrics: what's working, what's broken, and what patterns emerge across thousands of conversations.

Building data products for customers

For data and analytics engineers, the shift from building internal business intelligence dashboards to customer-facing data products is not just a change in audience—it's a critical leap in engineering rigor. It means trading the relative safety of an internal data warehouse for the complexity of a mission-critical, production system with external-facing SLAs. This has important implications for our work:

Data quality, data freshness, performance and availability are business-critical
Documentation becomes a product feature
Every edge case becomes a support ticket

This shift cascades through our Data Platform and Analytics teams work: We don't operate like classical internal data teams but rather like engineering teams building critical applications.

How we capture high-quality data

Design and organizational principles

Parloa serves diverse industries, languages, and workflows. Our data model can't be opinionated about domain-specific logic; it needs to be flexible enough to accommodate use cases we haven't seen yet. It also requires us to provide our customers with capabilities to enrich our data products, custom semantics, and context.

Our engineers don't just produce data - they own data outcomes. That means:

Understanding customer workflows and working backwards, translating them from success metrics into raw events
Adopting our standardized events taxonomy such that it describes the system-user interactions consistently and comprehensibly
Treating documentation as a product
Prioritizing reliability and resilience for our data services and infrastructure

Event schema: Why Activity Streams 2.0

Our platform leverages an event-based approach to capture analytics data. Choosing the right event taxonomy was critical. We needed a standard that could handle conversation and voice-specific semantics (interruptions, intermediate messages, call transfers) while remaining extensible for use cases we haven't seen yet.

We evaluated several options:

Custom schema: Maximum flexibility, but every downstream consumer needs custom parsing logic
OpenTelemetry: The standard for observability, but optimized for traces and metrics rather than domain events

We chose Activity Streams 2.0 (the W3C standard originally designed for social media apps) because it provides:

Strong vocabulary extension mechanisms: We can add conversation-specific activity types while maintaining compatibility with standard tooling. For example:

service:request:toolResponse:conversation, service:recognize:speechStart:conversation, service:end:audioStream:conversation

Actor-object-action semantics: Natural fit for "agent calls tool," "customer interrupts agent," "system transfers to human agent"
JSON-LD foundation: Enables semantic linking and future integration with knowledge graphs

The tradeoff: increased schema complexity compared to flat JSON events. But this pays off in reduced transformation logic—our analytics queries can leverage the standardized structure rather than custom parsers for each event type.

Data producers generate events that consistently use our taxonomy based on this AI agent-specific adaptation of Activity Streams 2.0. We collect these events as structured telemetry designed for downstream analysis, leveraging a unified ingestion framework.

Data platform architecture

Our data platform follows a modern lakehouse pattern with five distinct layers:

┌─────────────────────────────────────────────────────────────────┐
│                     EVENT GENERATION LAYER                      │
├─────────────────────────────────────────────────────────────────┤
│  Voice Agent Runtime  →  Activity Streams 2.0 Events            │           
└────────────────┬────────────────────────────────────────────────┘
                 │
                 ▼
┌─────────────────────────────────────────────────────────────────┐
│                     INGESTION LAYER                             │
├─────────────────────────────────────────────────────────────────┤
│  Event Stream (Kafka) → Schema Validation                       │
└────────────────┬────────────────────────────────────────────────┘
                 │
                 ▼
┌─────────────────────────────────────────────────────────────────┐
│                     STORAGE LAYER                               │
├─────────────────────────────────────────────────────────────────┤
│  Azure Data Lake Storage (ADLS) + Unity Catalog                 │
└────────────────┬────────────────────────────────────────────────┘
                 │
                 ▼
┌─────────────────────────────────────────────────────────────────┐
│                   TRANSFORMATION LAYER                          │
├─────────────────────────────────────────────────────────────────┤
│  Spark Declarative Pipeline                                     │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐           │
│  │ Raw Tables   │→ │ Cleaned      │→ │ Metrics      │           │
│  │ (Bronze)     │  │ (Silver)     │  │ (Gold)       │           │
│  └──────────────┘  └──────────────┘  └──────────────┘           │
└────────────────┬────────────────────────────────────────────────┘
                 │
                 ▼
┌─────────────────────────────────────────────────────────────────┐
│                      SERVING LAYER                              │
├─────────────────────────────────────────────────────────────────┤
│  Delta Sharing + Databricks Lakehouse                           │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐           │
│  │ Customer     │  │ Internal     │  │ Raw Data     │           │
│  │ Dashboards   │  │ Analytics    │  │ Access       │           │
│  └──────────────┘  └──────────────┘  └──────────────┘           │
└─────────────────────────────────────────────────────────────────┘

Event generation

AI agent interactions emit Activity Streams 2.0-compliant events in real-time. Each event (e.g., call started, intermediate message played, tool invoked, transfer initiated) includes semantic context, timing, and relationships to other events. We extended the W3C standard with conversation-specific vocabulary while maintaining compatibility with standard Activity Streams tooling. For example:

agent.create.message.conversation, service:request:routingDestination:conversation

Ingestion (real-time)

Events flow through Kafka with schema validation at ingestion. Invalid events route to a dead letter queue for investigation - this is our first quality gate.

Storage (immutable)

Raw events land in Azure Data Lake Storage (ADLS) partitioned by tenant, date, and event type. This immutable layer serves as our source of truth for investigation and backfills. We use Delta Lake for ACID guarantees and time-travel queries, with Unity Catalog providing centralized governance and metadata management across the platform.

Transformation (batch, incremental)

We use Spark Declarative Pipelines in a medallion architecture:

Bronze: Raw events
Silver: Validated, deduplicated, enriched with business context
Gold: Pre-aggregated metrics and dimensional models

Declarative transformations are defined as SQL queries with dependency management, enabling Databricks to optimize execution plans. Incremental models refresh periodically for operational metrics, ensuring customers see fresh data throughout the day.

Serving (multi-modal access)

Customer dashboards query pre-aggregated Gold tables from our Data Lakehouse. Customers can get access to event-level data through Delta Sharing, an open protocol for secure data sharing. Internal analysts get direct SQL access to Silver and Gold tables for ad-hoc exploration through Unity Catalog permissions.

Data warehouse setup

We use Databricks on Azure for both transformation and serving, allowing us to leverage the same compute infrastructure for both data processing and querying. The lakehouse architecture eliminates the need for separate ETL pipelines to move data between storage and compute.

Data sources

Our data products combine two core capabilities:

Event-level instrumentation

We track granular behavioral signals from AI agents in production:

Routing and call handover timing and triggers
Audio events
Standardized and customizable parameters from tool calls
Standardized deterministic evaluations for known issues (e.g., checks for tool calls and their sequence, tool call parameters, presence of intermediate messages, etc.)
Latency measures

LLM-powered evaluation

We leverage the “LLM as a judge” approach to generate conversation insights via evaluation, like call resolution, intent classification, or callers asking for human assistance. The real power of LLM evaluations shows in its flexibility: The users of our platform can define custom evaluations and use-case-specific topic classifiers (e.g., product extraction or customer segmentation).

How we ensure high data quality

Shift left: catching issues before they reach customers

Data teams often discover quality issues after data reaches production dashboards - when stakeholders report incorrect metrics. We've inverted this model by shifting validation as far left as possible in our data pipeline:

At event generation: Envelope schema enforcement ensures structural consistency before events enter Kafka
At ingestion: JSON Schema validation catches malformed data at the pipeline entry point
At transformation: DQX quality checks quarantine invalid records before they reach Silver/Gold layers
Before release: Test-driven development with unit, integration, and smoke tests validates logic before deployment

This "shift left" approach means we detect and fix issues progressively earlier:

Schema violations: Caught at generation (seconds after creation)
Data quality issues: Caught at transformation (minutes before serving)
Business logic bugs: Caught in testing (days before production)
Metric definition problems: Caught in domain expert review (weeks before customer impact)

The further left we catch an issue, the lower the cost - both in engineering time and customer trust.

Data quality via engineering excellence

Data quality = trust. What this means for our data products: Incorrect metrics don't just break dashboards - they erode platform trust.

We implement data quality checks using the DQX framework (Databricks Labs), which is integrated directly into our transformation pipelines. Unlike traditional approaches that detect issues after data reaches production, DQX validates data in real-time as it flows from Bronze to Silver and from Solver to Gold layers.

Records that fail validation - whether due to schema violations, invalid values, duplicated records or broken business rules - are automatically quarantined with detailed error context. This prevents bad data from reaching customer-facing dashboards while preserving it for investigation and potential correction. Valid records proceed through the pipeline to feed metrics and analytics, ensuring customers only see data that meets our quality standards.

Data quality check architecture

┌──────────────────────────────────────────────────────────────────┐
│                         BRONZE LAYER                             │
│                   (Raw Events - ADLS/Delta)                      │
└────────────────────────────┬─────────────────────────────────────┘
                             │
                             ▼
┌──────────────────────────────────────────────────────────────────┐
│                    DQX QUALITY VALIDATION                        │
├──────────────────────────────────────────────────────────────────┤
│  • Schema validation (required fields, types)                    │
│  • Value validation (nulls, ranges, enums)                       │
│  • Business rules (e.g. tenant_id exists, duration > 0)          │
│  • Cross-source checks (semantic validation)                     │
└────────────┬──────────────────────────────┬──────────────────────┘
             │                              │
             │ Valid Records                │ Invalid Records
             ▼                              ▼
┌──────────────────────────┐     ┌──────────────────────────────┐
│    SILVER LAYER          │     │   QUARANTINE TABLES          │
│  (Validated & Cleaned)   │     │  (Failed Quality Checks)     │
│                          │     │  • Detailed error context    │
│  → Gold Layer            │     │  • Investigation & re-ingest │
│  → Serving Layer         │     │  • Monitored via dashboard   │
└──────────────────────────┘     └──────────────────────────────┘

We also enforce quality through concrete practices:

Upstream validation: Schema validation at ingestion using JSON Schema with custom validators for domain-specific constraints (e.g., tenant_id must exist, call_duration must be positive)

Automated data quality checks: We use custom data quality tests in our transformation pipelines that validate completeness (no unexpected nulls), freshness (data arrived within SLA), and consistency (foreign keys resolve, aggregations match expected distributions)

Test-driven development: We write data quality tests before implementing pipelines. For example, before adding a new metric calculation, we write tests that verify edge cases (division by zero, empty datasets, late-arriving data) using dbt tests and custom validators

High test coverage: Every transformation has unit tests (mocked data), integration tests (test environment with representative data), and smoke tests (production with automated alerts)

Data quality monitoring: we leverage the Databricks SQL alerts to monitor when records arrive to the quarantined tables or data freshness is low from upstream tables (late arriving data)

Tight feedback loops: Anomaly detection on key metrics with alerts for SLA breaches, automated Slack notifications for data quality failures, and daily summary reports of quarantined records

Comprehensive observability: We instrument our pipelines for distributed tracing, custom metrics for data volume and latency, and detailed logging of validation failures with examples of rejected records

Incident response runbooks: Documented playbooks for common failure modes (upstream schema change, spike in invalid events, downstream query timeout) with clear escalation paths

Post-mortem culture: Every data quality incident gets a blameless post-mortem focused on systemic improvements - we've added validation rules, improved error messages, and enhanced monitoring based on real failures

Our major principle: Fix data quality at source

We draw a clear distinction between data quality validation and semantic transformation:

Data quality issues (fix at source, don't patch downstream), for example:

Schema violations: Missing required fields, incorrect types
Invalid values: Null tenant_id, negative call_duration, invalid enums
Malformed data: Unparseable timestamps, broken JSON structures

When we detect quality issues, we quarantine the records and fix them in the source system. We don't patch bad data in transformation pipelines - that creates technical debt and obscures the root cause.

Documentation as user experience

Our data documentation is customer-facing, which means it must explain:

What the event or metric measures (definition)
Why it matters (business context)
When to use it (decision support)
How it's calculated (transparency)

We treat docs as a core part of the product experience.

Federated governance: balancing autonomy and consistency

Data generation happens across teams and services, and quality is non-negotiable. Our governance model treats each event as a customer-facing product, with analytics acting as the product owner and data-producing teams implementing it.

Envelope schema for safety

We use an envelope schema that provides structural guarantees (timestamps, actor, action, object) without dictating the specifics of the payload. Teams can't break the pipeline - they can only fail validation, which catches issues before production.

Definition consensus process

Analytics proposes new events or changes based on customer needs
Representatives from data-producing teams review for feasibility
Product (analytics) has final decision authority to ensure cross-domain consistency

This asymmetry is intentional: analytics sees patterns across all teams that individual producers can't

Enforcing taxonomy compliance

We validate against the events taxonomy at ingestion time. Invalid events are quarantined with detailed error messages that point to the taxonomy docs. Teams receive immediate feedback in development, not after deployment.

Handling breaking changes

Because the envelope is fixed and payload is flexible, "breaking changes" are actually semantic - redefining what "call transfer" means, not changing data structure. We version the taxonomy itself and require explicit migration plans that show impact analysis across all downstream consumers before approval.

Governance-as-code

Event definitions live in a central repository as JSON Schema and human-readable descriptions. Pull requests trigger automated impact analysis showing which teams' events would fail new validation rules. Analytics reviews conflicts before merging.

This federated model enables us to scale without creating bottlenecks or sacrificing consistency.

The bigger picture

Customer-facing data products aren't just about dashboards - they're about embedding insights into decision-making workflows and building trust through reliability, resilience, and high-quality data. In customer service, the most important voice isn't the AI agent's - it's the customer's. Every data product we ship should help our customers hear them more clearly.

More in this series

11 December 2025Insights

What happens when calls never end?

Every customer conversation has a rhythm: a start, a middle, and (generally) an end. But what happens when it doesn’t?

Stefan Ostwald, Mariano Kamp and Anjana Vasan

11 December 2025Research

The never-ending conversation: Measuring long-conversation performance in LLMs

At Parloa, our AI agents handle these long conversations every day, i.e., tool-heavy dialogues where context grows fast and ambiguity grows faster. And we wanted to know: does performance actually degrade when conversations get that long or do today’s models hold steady?

Stefan Ostwald, Mariano Kamp and Anjana Vasan

11 December 2025Insights

GPT-5.2 doesn’t just follow instructions, it follows through

There’s a specific kind of model failure we like to track closely. It doesn’t show up in latency graphs or user feedback. It sounds like the system is doing the right thing. It looks like it’s doing the right thing. And yet: the action never happens.

Stefan Ostwald, Matthäus Deutsch and Anjana Vasan