Physical Cluster replication

Designed for an API-first launch without disrupting existing workflows

Designed the user experience for Physical Cluster Replication (PCR), a disaster recovery protocol that enables customers with two data centers to survive a regional outage with minimal downtime.

PCR addressed a major adoption blocker for enterprise customers unable to support CockroachDB’s three-region deployment model, opening a new market segment (20M+) for CockroachDB Cloud.

I determined what PCR information was needed in the Cloud UI to support an API-first launch without disrupting existing workflows.

My role - Sr. Staff Product Designer leading technical UX strategy, workflow architecture, and end-to-end product design‍ ‍

Team - 1 UX researcher, 2 PMs, 2 engineering teams

Timeline - Jul 2025 - Nov 2025

Platform - CockroachDB Cloud

Status - Multiple customers using this feature in production


Challenge

CockroachDB had lost enterprise opportunities because some prospects operated across two data centers and could not support CockroachDB’s three-region deployment model.

PCR was introduced to address this gap, and the business expected enough demand to justify investing in an API-first release. The team intentionally avoided designing the full UI workflow upfront. We wanted to validate real adoption patterns first and learn whether customers preferred managing replication programmatically or through the Cloud UI.

The challenge was determining what needed to exist in the UI to support the API launch safely and coherently, avoiding investment in workflows that might evolve significantly post-release.


Defining the UI scope

Because the PM came from a database infrastructure background and had limited experience designing customer-facing workflows, they relied heavily on me to define the UI implications.

I created a comprehensive journey map which helped the team visualize the long-term operational workflow, identify API-release blockers early, and align on a phased rollout strategy.

The PM and I quickly aligned on prioritizing the UI workflow in phases:

  1. Monitoring

  2. Maintenance

  3. Troubleshooting

  4. Creation

  5. Failover

This reframed the work from a narrow API support task into a scalable operational workflow foundation for future UI investment.

Full User Journey map

Monitoring section of User Journey map


Learning from research

I identified early that the most important UX problem was making the relationship between the Primary and Standby clusters immediately understandable.

The standby cluster intentionally behaved differently from a standard cluster. Settings were locked, writes were disabled, and parts of the system were managed automatically. Without clear context, these constraints appeared broken or confusing.

This insight shaped the overall design direction. Rather than presenting the standby as an independent cluster, I designed the experience around the replication pair:

  • labeling the Primary and Standby throughout the UI

  • creating a navigational connection between paired clusters

  • surfacing replication health and status prominently

I validated two organizational models through user interviews:

  • a hierarchical model where paired clusters lived together

  • an independent model where clusters could exist separately

Users consistently understood the hierarchical model more easily and struggled to recognize the relationship in the independent version. I recommended requiring paired clusters to exist within the same folder structure.

However, because the team was still learning how customers would operationalize PCR, Product chose the more flexible independent model to avoid introducing constraints that could limit adoption. I revised the language, hierarchy, and layouts based on the research findings to clarify the relationship within those constraints.

Research also exposed a broader issue, the existing Cluster overview page lacked sufficient hierarchy to support complex operational states like PCR. Rather than forcing the feature into an already overloaded framework, I initiated a parallel redesign of the Cluster overview experience. Read about that project in the Cluster health case study.

Users immediately understood relationship

Independent model increased flexibility but weakened comprehension


Prioritization and scope?

Not everything could ship with the API, so I developed a phased roadmap and got alignment on it early.

The first phase, the hard blocker for the API release, covered the Cluster list, Cluster overview, metrics, version upgrade, and cluster edit pages. These were the minimum needed to make the product coherent. Without them, users couldn't understand what they were looking at.

The second phase covered the failover flow itself: the confirmation step, the transition state, and the post-failover experience where the roles reverse. This was critical for the product to be usable end-to-end, but it could follow the API by a short window without creating immediate confusion for early users.

Additional phases covered setup improvements, inline guidance for first-time PCR users, and tooling for monitoring replication health over time. I proposed a set of metrics to collect at launch to give us a signal on which parts of the experience were causing friction drop-offs, failover initiation rates so that future iterations could be grounded in data rather than assumption.

The PM and engineering leads aligned on this sequencing. It was a practical negotiation: what had to be true for the API to ship without breaking trust, and what could follow once the core was solid.


Outcomes

The PCR API shipped. The UI work is in active development, tracking the phased plan.

Before this project, there was no shared model for how PCR should appear in the console. The journey map and the prioritized mockups gave the team — engineering, PM, and design — a concrete, agreed-upon picture of the feature across its full lifecycle. For a PM working primarily in the API layer, having that visual model changed how the conversations ran. UI implications that had been invisible became part of the planning process.

The PCR work also had a secondary impact: it was the direct catalyst for the Cluster Overview redesign. By stress-testing the overview page as a destination for PCR data, I surfaced a structural problem that the organization had been working around for years. Rather than continue to work around it, I used PCR as the forcing function to address it. Those two projects are now running in parallel and will ship in coordination.

The instrumentation proposal — collecting time-on-page and interaction data for the new PCR surfaces — is also in plan, which means we'll have a clean before/after signal to evaluate whether the phased rollout is achieving its goals.


What I’d do differently

If I were doing this project today, I would incorporate AI earlier in the workflow to accelerate exploration and iteration.

I’d use AI-assisted synthesis to summarize technical documentation, identify workflow edge cases, and rapidly generate early journey maps and relationship models. This would allow me to spend less time constructing foundational artifacts manually and more time validating operational assumptions with Users and stakeholders.

I would also use AI to expand exploratory design work by quickly generating alternative representations of the Primary/Standby relationship and stress-testing workflows through heuristic analysis before User research.

That said, the most critical parts of the project, defining the operational model, identifying the core User confusion around standby behavior, and aligning teams around long-term workflow implications still depended heavily on human judgment, facilitation, and systems thinking.