Cluster health dashboard
The Cluster Overview page is the most visited surface in the CockroachDB Cloud console and for years nobody had defined what it was actually for. Every team added information when it seemed appropriate. The result was a page that had grown without a purpose, a junk drawer that the design team had wanted to fix for years without the runway to do it.
When two projects arrived needing space on this page, I recognized the tipping point. Adding more to a broken foundation would make the page non-functional. I made the case to pause and turned the moment into a design-led effort to define what a Cluster overview page is for, a question our team had been asking for years, and finally had the leverage to answer.
My role - Lead Designer, set milestones, drove cross-functional alignment, led research end-to-end
Team - 3 designers, 1 UX researcher, data team, 5 PMs, engineering
Timeline - July 2025 - Dec 2025
Platform - CockroachDB Cloud
Status - Development in progress
Challenge
Advanced cluster overview (before), most visited surface showing only static configuration and settings data
The Cluster overview page had grown organically since 2021 with no defined purpose. Teams added information when it seemed appropriate and the result was different for every cluster type. Advanced clusters showed only static configuration and settings on their most visited page. No health data, no sense of whether the cluster was running well. Basic and Standard clusters had evolved further to include some health information, but the hierarchy was wrong, too much visual weight on configuration and settings, no compute information over time. Two cluster types, two broken pages, no shared framework for what belonged at the overview level.
The Actions menu had grown equally unwieldy, mixing topology changes, settings, and cluster actions with no organizing logic.
In 2025, two new features were heading for this page, a resilience demo and PCR. As lead designer for PCR, I recognized that we couldn’t keep adding to a broken foundation. I got buy-in from the Design team and reframed the work. Before finding space for two more features, we’d define what this page was actually for and build a decision-making model for everything that came after.
Simple ideas
Through every step, we've focused on staying true to our values and making space for thoughtful, lasting work.
Lasting impact
We build with clarity, act with integrity, and always stay curious.
The original ask was straightforward, find the right place to add new PCR data to the existing Cluster overview page. But I paused on that framing, it assumes that the current page structure is sound. We’d never actually interrogated whether what was already there belonged at the overview level at all.
I recognized that our layout problem was actually an information architecture problem and we needed a principled framework first. I ran an object-oriented UX (OOUX) exercise to map the full object hierarchy across the console. The work was hard in a domain this complex, and revealing. We found significant navigation dead ends and child objects that were effectively invisible from the overview. This exercise helped us form an opinion on what belongs here before deciding where it goes.
Reframing the problem
other OOUX screenshots?
Challenge
Reframing the problem
Learning from research
Calibrating visual weight
The original ask was straightforward, find the right place to add new PCR data to the existing Cluster overview page. But I paused on that framing, it assumes that the current page structure is sound. We’d never actually interrogated whether what was already there belonged at the overview level at all.
I recognized that our layout problem was actually an information architecture problem and we needed a principled framework first. I ran an object-oriented UX (OOUX) exercise to map the full object hierarchy across the console. The work was hard in a domain this complex, and revealing. We found significant navigation dead ends and child objects that were effectively invisible from the overview. This exercise helped us form an opinion on what belongs here before deciding where it goes.
Challenge
Advanced cluster overview (before), most visited surface showing only static configuration and settings data
The Cluster overview page had grown organically since 2021 with no defined purpose. Teams added information when it seemed appropriate and the result was different for every cluster type. Advanced clusters showed only static configuration and settings on their most visited page. No health data, no sense of whether the cluster was running well. Basic and Standard clusters had evolved further to include some health information, but the hierarchy was wrong, too much visual weight on configuration and settings, no compute information over time. Two cluster types, two broken pages, no shared framework for what belonged at the overview level.
The Actions menu had grown equally unwieldy, mixing topology changes, settings, and cluster actions with no organizing logic.
In 2025, two new features were heading for this page, a resilience demo and PCR. As lead designer for PCR, I recognized that we couldn’t keep adding to a broken foundation. I got buy-in from the Design team and reframed the work. Before finding space for two more features, we’d define what this page was actually for and build a decision-making model for everything that came after.
Simple ideas
Through every step, we've focused on staying true to our values and making space for thoughtful, lasting work.
Lasting impact
We build with clarity, act with integrity, and always stay curious.
Reframing the problem
other OOUX screenshots?
Learning from research
Round 1 and 2 prototype, links?
Based on the research, I got buy-in on the page's purpose. Now I needed to decide how much space each metric deserved. This required balancing research findings, design judgment, and the needs of two parallel projects.
Throughput and latency were unambiguous. Interview after interview, users described these as the visual embodiment of their workload, the most direct signal that things are running as expected. We collapsed these into a single chart to make it easier to understand correlation and gave it the largest footprint. It helped that a parallel project, a resilience demo feature another designer was building, also needed throughput and latency on this page. This validated the priority and made alignment easy.
Compute surprised us. I originally gave it a small footprint, treating it as secondary to throughput. But as interviews accumulated it became clear users weighted it almost as heavily so it moved up. Research, not assumption, made that call.
Configurations and settings were the hardest debate, mostly between all of the designers. Research showed users rarely touched it, more of a set it and forget it. The case for moving it to the bottom of the page was logical. But it describes what the cluster is, its fundamental building blocks. Removing it from the top felt like removing the label from a piece of equipment. We kept it small, at the top, as a "what am I looking at" anchor.
Storage was straightforward: genuinely set it and forget it for most users. It earned a small footprint with clear affordance for warning states, the number matters most when it's almost gone.
Nodes were the most delicate needle to thread. Users were curious about them but they chose a managed service specifically to not have to manage nodes. Showing too much detail would create anxiety about something Users can't act on. We gave nodes a medium footprint, enough to satisfy the curiosity not enough to suggest responsibility. As with Storage, having a nodes section provided clear affordance for warning states.
I partnered with a UX researcher to analyze usage data and conduct user interviews. Usage data showed the Cluster Overview page had significant traffic, but that was misleading. Clicking into any sub-page (databases, networking, backups) required passing through the overview first. High page views didn't mean anyone was actually looking at it. While the traffic for the Database and SQL activity sub-pages did represent interest in that content.
Our hypothesis going into round 1 of user interviews was that the page should serve two jobs: give operators a glimpse of what's inside the cluster (databases, SQL activity), and surface health at a glance. We tested structural information (configuration, settings), health data (throughput, latency, compute), and nested objects (databases, open connections, last backup, active private endpoints).
Users were clear: The health data was what mattered. The nested objects were nice to have, useful in calm moments and irrelevant in urgent ones.
With this learning, we reframed the page’s purpose to focus on health for round 2. What operators needed from this page was just enough signal to understand what’s happening on the managed side and whether it requires action on theirs. We learned that for managed infrastructure, there were some health indicators that needed to come from our system rather than the customer’s own observability tooling (Datadog, Prometheus).