Back to Work

Hitachi AIOps

Enterprise2019
Role
UX Architect
Timeline
2019-2021
Team
PM, Engineering, Data Science, Design

Snapshot

  • Company / Product: Hitachi Vantara
  • Timeline: 2019-2021
  • Role: UX Architect
  • Scope: Strategy, Discovery, UX, alignment, execution
  • Team: PM, Engineering, Data Science, Design

The Problem

Business Context

Hitachi Vantara's core business had traditionally been on-premise enterprise storage. As cloud adoption accelerated, most large enterprises shifted toward hybrid environments rather than fully migrating to the cloud. This created operational complexity that existing monitoring tools couldn't handle. Hitachi saw an opportunity to extend their storage monitoring capabilities and build an AIOps platform to help enterprises operate and manage risk across hybrid systems.

User Context

  • Excessive alert noise: Modern distributed systems generate high volumes of alerts across infrastructure, applications, and services, making it difficult for operators to identify meaningful signals.
  • Reactive operations: Monitoring tools primarily surfaced issues after failures occurred, forcing teams into reactive incident response rather than proactive risk management.
  • Fragmented system visibility: Critical signals were spread across multiple tools and teams, requiring manual correlation during incidents and slowing root-cause identification.
  • Ambiguous problem ownership: Alerts often lacked context around impact and causality, making it unclear which team should act and what action would meaningfully reduce risk.

Reframing the Problem

The problem was initially framed as using AI to reduce alert noise and identify actionable alerts. However, reducing alerts alone did not improve operator effectiveness—teams still wanted visibility into all signals and lacked trust in AI-driven alert decisions. As alert logic shifted from user-defined rules to opaque algorithms, confidence eroded further.

We reframed the problem around operator confidence, focusing on explaining how actionable situations were formed, how non-actionable alerts related to them, and enabling feedback-driven retraining. This shifted priorities from alert optimization to correlation, context, and transparency, applying a systems-thinking lens to human decision-making in complex operational environments.

Key Insight

The real problem wasn't too many alerts—it was lack of confidence in deciding which alerts mattered and why. Operators needed to understand the system's reasoning, not just trust its conclusions.

The Solution

Rather than just reducing alert volume, we built a system that helped operators understand, validate, and trust AI-generated insights through five key experience moments:

1. Understanding Why a Situation Exists

The change: Instead of surfacing isolated alerts, the system presented a chronological timeline showing how a situation evolved over time, with clear causal relationships between events.

The impact: Operators could see how signals accumulated ("disk filled → memory pressure → app crash") rather than trusting a black-box AI decision. This reduced skepticism and helped operators quickly validate whether an issue required action.

Situation overview
Situation overview showing summary card and chronological timeline with causal annotations.

2. Making Relationships Visible Through Topology

The change: Each situation included a scoped topological view showing only the components involved—hosts, pods, containers, services—and how they were related and impacted.

The impact: Operators could reason about blast radius and dependency chains ("this pod hosts 3 containers, 2 are failing, feeding this service") instead of manually correlating resources across multiple tools. This accelerated root-cause analysis in complex distributed systems.

Topology view
Topology view showing clean component relationships vs impacted view with alerts toggled on.

3. Reducing Noise Through Situation-Based Grouping

The change: Repetitive alerts were automatically grouped into situations, with clear indicators showing how many individual alerts contributed to each pattern.

The impact: Made large volumes of alerts scannable and meaningful, especially during high-noise incidents. Operators could focus on 5 situations instead of triaging 200 individual alerts repeatedly. This prevented alert fatigue and reduced time spent on repetitive triage.

Alert grouping
Comparison of ungrouped alerts view vs situation-grouped view.

4. Closing the Loop With Human Feedback

The change: Operators could provide instant feedback—marking false positives, incorrect groupings, or validating root causes—through single-click inline actions directly within the situation view.

The impact: Reinforced trust by making operators active participants in improving the system, not passive recipients of AI decisions. The zero-friction interaction (no forms, no context switching) led to ~90% feedback engagement, enabling continuous learning without requiring manual alert rule maintenance.

Feedback actions
Situation detail view showing inline feedback actions and acknowledgment workflow.

5. Scanning System Health at a Glance

The change: A customizable dashboard provided a 10,000-ft view of overall system health, active situations by severity, and impacted resources across the hybrid environment.

The impact: Operators managing large, complex environments could quickly assess risk and prioritize attention without diving into individual alerts. Supported rapid decision-making when time and attention are limited.

Dashboard overview
Dashboard overview showing system health, active situations, and impacted resources.

Design Process & Key Decisions

Here's how I approached three critical design challenges, the alternatives I explored, and why specific solutions won out.

Decision 1: Designing the Situation Timeline

The Challenge

Operators needed to understand why the AI grouped certain alerts into an actionable situation. Without this understanding, they defaulted to ignoring AI recommendations and manually triaging individual alerts—defeating the purpose of the system.

What I Explored

  • Confidence scores + ranking: Rejected. Operators found numeric confidence scores meaningless without context. "78% confident" doesn't explain why or help validate the decision.
  • Correlation graph (network diagram): Rejected. While technically comprehensive, it required too much cognitive effort during high-pressure incidents. Operators struggled to quickly extract "what happened first" from a web of nodes and edges.
  • Chronological timeline with causal annotations: Selected. Matched operators' mental model of incident progression and made cause-and-effect relationships immediately scannable.

Why the Timeline Worked

  • Matched mental models: Operators naturally think about incidents as "what happened first, then what broke." The timeline made this explicit.
  • Made causality visible: Visual connectors and annotations showed relationships without requiring graph literacy. Operators could quickly validate "yes, disk filled before app crashed."
  • Enabled rapid validation: Operators could scan the timeline in seconds to confirm or reject the AI's reasoning, building trust through transparency.

How I Validated It

I tested paper prototypes with 5 operators during concept validation sessions, using real alert data from their environments. The key insight emerged when operators repeatedly asked to "replay" the incident rather than just see a summary. This reinforced that the timeline needed to support investigation, not just presentation.

Based on feedback, I iterated on visual hierarchy—reducing the prominence of timestamps and increasing the weight of causal annotations, since operators cared more about why than when.

Trade-offs I Made

  • Sacrificed completeness for clarity: Not every alert could fit in the timeline view. I prioritized the causal path (root cause → cascading failures) over showing every related alert. Operators could drill into the full alert list if needed.
  • Linear time over topology: The timeline became the primary view, with topology relegated to a secondary tab. This was deliberate—operators needed to understand sequence before exploring structure.
Timeline iterations
Evolution from timestamp-heavy design to causal-annotation-heavy design.

Decision 2: Designing the Topology View

The Challenge

In distributed systems, understanding relationships between components is critical for root-cause analysis. Operators needed to see how a failing container impacted a pod, which service it belonged to, and what downstream dependencies were affected. However, full topology graphs quickly become overwhelming in complex environments with hundreds of nodes.

What I Explored

  • Full infrastructure graph: Rejected. While comprehensive, it created visual noise and made it impossible to focus on what mattered for the current situation.
  • Situation-scoped topology: Selected. Only showed components directly involved in the current situation, with the ability to expand related nodes on demand.
  • Alert overlay toggle: Added as a refinement. Allowed operators to see the "clean" topology first, then toggle alerts on to understand blast radius.

Why This Approach Worked

  • Reduced cognitive load: By scoping to the situation, operators could immediately see what was impacted without manually filtering out irrelevant nodes.
  • Progressive disclosure: The toggle between "topology only" and "topology + alerts" allowed operators to first understand structure, then layer on problems.
  • Supported root-cause reasoning: Operators could trace dependency chains visually: "This pod hosts 3 containers, 2 are failing, they feed this service, which is why the API is down."

Trade-offs I Made

  • Limited exploration depth: Engineering wanted full interactive exploration (zoom, filter, search). I deprioritized this for MVP to preserve clarity. Operators needed to understand this situation's blast radius, not explore the entire infrastructure.
  • Topology as secondary view: Based on validation, I positioned topology as a secondary tab behind the timeline. Operators needed sequence first, structure second.

Decision 3: Designing the Feedback Loop

The Challenge

Even with transparent reasoning, operators remained skeptical of AI decisions unless they could influence the system. However, traditional ML retraining requires data science expertise and long iteration cycles. We needed a feedback mechanism that felt immediate and actionable to operators, even if the actual model updates happened asynchronously.

What I Explored

  • Post-incident survey: Rejected. Operators hated breaking flow to fill out forms. Completion rates in testing were < 20%.
  • Inline feedback actions: Selected. Embedded quick actions ("False positive", "Correct grouping", "Wrong root cause") directly into the situation view, requiring a single click.
  • Integration with existing workbench: Added as a critical enhancement. Worked with PM and engineering to allow operators to surface situations the AI missed by tagging alerts in their existing workflow.

Why This Approach Worked

  • Zero-friction interaction: One-click actions meant operators could provide feedback in under 2 seconds without leaving context.
  • Psychological ownership: Operators felt like active participants in improving the system, not passive recipients of AI decisions. This fundamentally shifted trust dynamics.
  • Closed the loop: By integrating with the workbench, operators could teach the system about situations it missed, not just validate what it found.

How I Validated It

During concept testing, I showed operators mockups with different feedback affordances. The critical insight came when one operator said: "I don't mind giving feedback if it doesn't slow me down, but if you make me fill out a form during an incident, I'll never use this."

This led to the inline action design, which we validated by having operators walk through realistic scenarios using clickable prototypes. Completion rates jumped to ~85% in testing.

Trade-offs I Made

  • Limited feedback granularity: Data science wanted detailed feedback (which specific alerts were wrong, why, etc.). I pushed for simple binary signals to preserve usability. We could add richness later if operators actually used the simple version.
  • Async model updates: Operators wanted to see immediate changes, but ML retraining takes time. I designed clear messaging around "Your feedback will improve future situations" to set correct expectations without breaking trust.
Feedback flow
Feedback flow showing inline actions and workbench integration.

Strategy & Approach

Guiding Principles

  • Operator confidence over automation: AI should support human judgment, not replace it. Operators needed to understand why a situation was surfaced in order to trust and act on it.
  • Transparency enables learning: The system had to make relationships between signals visible and allow operators to provide feedback, enabling continuous improvement over time.

My Role

I led end-to-end product discovery and UX strategy for the AIOps initiative, working closely with product, engineering, and data science teams.

Key responsibilities included:

  • Discovery research: Defined the problem space through stakeholder interviews, customer interviews with operators and SREs, and UX workshops that mapped incident workflows and surfaced systemic operational challenges across enterprise environments.
  • UX and product strategy: Reframed the problem from alert monitoring to predictive risk management, shaping guiding principles and experience direction for the platform.
  • MVP definition and validation: Partnered with PM and engineering to prioritize high-leverage capabilities for the initial release, balancing feasibility, data readiness, and user impact, and validated key assumptions through concept testing with operators to ensure the solution reduced noise and improved decision clarity.

Research & Synthesis

  • Research inputs: Synthesized insights from stakeholder sessions, UX workshops, and customer interviews with operators and SREs, alongside analysis of existing alerting and incident workflows.
  • Synthesis and framing: Mapped end-to-end operational workflows to identify breakdowns in signal interpretation, trust, and ownership, and used these patterns to define high-impact opportunity areas.
  • Decision-led prioritization: Evaluated concepts based on their ability to increase operator confidence, explain system behavior, and scale across hybrid environments—intentionally prioritizing correlation and context over surface-level alert management or operator workbench optimization.

Leadership & Influence

The team was broadly aligned on direction, but key decisions remained around how operators should influence the system, how much of the existing workbench to integrate, and how AI-generated alerts should be presented and understood.

Cross-Functional Alignment

  • Integrating with existing workflows: While the initial plan focused on collecting feedback only through direct user input, I worked with PM, engineering, and data science to integrate feedback into the existing operator workbench. This ensured operators could surface situations the AI missed and allowed the product to fit naturally into established observability workflows.
  • Balancing scope and operator value: Engineering proposed adding workbench-level metrics (e.g., MTTR, MTTA) into the MVP. Based on validation interviews and concept testing, I advocated for deprioritizing these metrics, as they did not meaningfully improve operator confidence or decision-making at this stage.
  • Improving interpretability of AI outputs: Early alerts generated by the data science team were technically accurate but generic. I partnered closely with engineering and DS to refine alert copy and structure, ensuring operators could quickly understand why a situation mattered rather than just what was triggered.

Influence Through Artifacts

I used lightweight but concrete artifacts to align teams and drive shared understanding:

  • Journey maps to validate end-to-end flows and information architecture early on.
  • Concept walkthroughs using real engineering data to ground discussions with PM, engineering, and data science in actual system behavior.
  • Design validation sessions to bring operator feedback directly into cross-functional decision-making.

Navigating Constraints and Compromise

A recurring constraint was pressure to limit MVP scope for faster GTM. Rather than expanding surface-level features, I worked with product, engineering, and data science to make deliberate trade-offs—reducing the scope of topology exploration while ensuring the remaining views clearly explained relationships and root causes.

In parallel, we invested in improving copy, feedback mechanisms, and explainability to preserve operator trust and system clarity within the constrained scope.

Impact & Results

While long-term GA metrics were still evolving, Early Access with a managed services partner provided strong validation of the product direction and experience strategy.

Operational Outcomes (Early Access)

  • Faster incident response: Correlated situations with clear context and root-cause signals led to significant reductions in MTTR and MTTA within existing IT workbenches.
  • Meaningful noise reduction: Situation-based grouping reduced alert noise substantially, allowing operators to focus on emerging risk rather than repetitive triage.

Behavioral Impact

  • More effective escalations: Operators escalated fewer tickets, and when they did, shared concise, actionable context instead of raw alerts or log dumps.
  • Increased operator confidence: Explainable situations enabled operators to make decisions with greater confidence before involving engineering teams.
  • Reduced manual alert maintenance: Engineers relied more on AI-driven detection, reducing the effort spent creating and maintaining manual alerts while retaining oversight.

Trust, Engagement, and Adoption Signals

  • High engagement with feedback loops: ~90% of root-cause analyses received operator feedback, reinforcing trust in the system and enabling continuous improvement.
  • Validation at managed-services scale: Rean Cloud expanded Early Access assessments across multiple customers, confirming the system's ability to support multi-tenant operational environments.
  • Signals of deeper adoption: Operators requested integrations with runbooks, indicating confidence in the system and intent to embed it more deeply into workflows.

Lasting Product Impact

  • Transferable approach: Rean Cloud requested a similar AIOps setup for Hitachi's storage product, validating the broader applicability of the solution.
  • Shared language and mental models: Concepts such as situations, context, and operator confidence became shared vocabulary across product, engineering, and data science teams, shaping future problem framing and evaluation.

Reflections & What I Learned

My Role

I led problem framing, UX and product strategy, and key MVP decisions for the AIOps initiative. My focus was aligning product, engineering, and data science around a shared understanding of operator confidence, explainability, and system-level thinking under real-world constraints.

Collaboration

I worked closely with PMs, engineering, and data science to balance scope, feasibility, and user value—making deliberate trade-offs to preserve clarity and trust while moving quickly toward Early Access.

What I Learned

This project reinforced that meaningful AI systems are built through compromise, not optimization alone. Reducing scope in some areas (like deep topology exploration) while investing in explainability, language, and feedback loops proved far more impactful than adding features—especially when designing for trust in complex operational environments.

The most important lesson: operators don't need perfect AI—they need AI they can understand, validate, and correct. Trust comes from transparency and participation, not accuracy alone.