What are Failed Customer Interactions (FCIs) and why are they important?
Failed Customer Interactions (FCIs) are instances where a user is unable to complete their intended action due to system errors, such as 500 or 404 errors. FCIs are crucial because they directly measure the customer experience and satisfaction, going beyond traditional availability metrics to focus on whether users can successfully accomplish their goals. Reducing FCIs leads to higher customer satisfaction and improved Net Promoter Scores (NPS).
How are FCIs measured in modern cloud environments?
FCIs are measured by dividing the total number of failed interactions by the overall number of interactions, typically at the account or service level. This approach provides a count-based metric that reflects the frequency of failures, offering a more nuanced view of system reliability than traditional time-based availability metrics. Teams often use observability platforms like Datadog to track and visualize FCIs in real time.
Why are traditional availability metrics insufficient for distributed systems?
Traditional availability metrics, which focus on system uptime, were designed for monolithic architectures and do not account for the complexity of distributed systems with microservices, containers, and cloud infrastructure. These metrics may not accurately reflect the customer experience or the impact of localized failures. FCIs provide a more customer-centric and actionable measure of system health in modern architectures.
What is the industry benchmark for FCIs?
According to AWS, the industry benchmark for failed customer interactions is less than 0.025% of total transactions. For example, in a system with one million transactions, fewer than 250 should result in failures. This benchmark helps organizations gauge their performance and set improvement goals.
How do FCIs impact customer satisfaction and NPS scores?
Reducing FCIs has a direct positive correlation with customer satisfaction and Net Promoter Scores (NPS). For example, GoodHire reduced their FCI rate from 3.2% to 0.02% over three quarters, which coincided with an increase in NPS from 63 to 70, demonstrating the value of focusing on FCIs for customer experience improvement.
What types of errors typically count as FCIs?
FCIs include errors such as HTTP 500 (internal server errors), frequent 404 (not found) errors, and any other system failures that prevent users from completing their intended actions. These errors can occur on any platform and are a key indicator of system reliability from the user's perspective.
How can teams use observability platforms to track FCIs?
Teams can leverage observability platforms like Datadog to set up dashboards that monitor and visualize FCIs in real time. This enables scrum teams to track FCIs for their services, prioritize remediation, and set quarterly goals for reduction, fostering accountability and continuous improvement.
What organizational changes help reduce FCIs?
Embedding operations responsibilities within scrum teams, rather than separating them into dedicated SRE teams, empowers teams to own and address FCIs directly. This model, especially in cloud environments, ensures that cost management and operational excellence are integrated into daily workflows, leading to more effective FCI reduction.
How did GoodHire achieve a significant reduction in FCIs?
GoodHire reduced their FCI rate from 3.2% to 0.02% by making FCI reduction a quarterly goal, leveraging observability tools like Datadog, and embedding operations within scrum teams. This focus on FCIs led to improved customer satisfaction and higher NPS scores.
What role does mindset play in reducing FCIs?
Adopting a customer-centric mindset, where solving customer problems is prioritized, is essential for reducing FCIs. Teams that focus on minimizing FCIs and collaborate closely with customer success functions are more likely to achieve significant improvements in customer experience and operational excellence.
Operational & Engineering Excellence
What are the pillars of SaaS success according to the article?
The pillars of SaaS success are engineering excellence and operational excellence. Engineering excellence involves developing, testing, and deploying high-quality software, while operational excellence focuses on maintaining, managing, and optimizing software in production to ensure a seamless customer experience.
Which metrics are important for tracking engineering excellence?
Key metrics for engineering excellence include escapes to production, code coverage, defects out of SLA, and release velocity. These metrics help assess software quality, test coverage, adherence to service level agreements, and the speed of deployment.
What metrics are used to measure operational excellence?
Operational excellence is measured using metrics such as performance, availability, FCIs, and cost management. These metrics evaluate the speed, reliability, and efficiency of software, as well as the ability to deliver value to customers while optimizing resource utilization.
How does embedding operations within scrum teams improve outcomes?
Embedding operations within scrum teams gives teams end-to-end ownership of both development and operational responsibilities. This approach increases accountability, accelerates issue resolution, and ensures that operational goals like FCI reduction and cost management are prioritized alongside feature development.
Why is a customer-centric metric like FCI valuable for SaaS companies?
Customer-centric metrics like FCI provide direct insight into the user experience, helping teams identify and address issues that impact customer satisfaction. By focusing on FCIs, SaaS companies can improve usability, performance, and task completion rates, leading to higher retention and advocacy.
How can autonomous management help reduce FCIs?
Autonomous management platforms like Sedai can proactively detect and resolve issues that cause FCIs, such as scaling problems or resource constraints. By automating routine operations and optimizing system performance, these platforms help minimize failed interactions and improve overall reliability.
What is the benefit of shifting from Kubernetes to serverless infrastructure in terms of FCIs?
Transitioning from Kubernetes to serverless infrastructure, as described in the article, led to a 90% reduction in mundane operational tasks. Serverless platforms, especially when combined with autonomous management, reduce the operational burden and help teams focus on higher-value work, contributing to lower FCI rates.
How does Sedai support operational and engineering excellence?
Sedai supports operational and engineering excellence by providing autonomous optimization, proactive issue resolution, and intelligent metrics that help teams monitor and improve system performance, reduce FCIs, and streamline operations. This enables teams to deliver reliable, high-quality software with minimal manual intervention.
What practical steps can teams take to reduce FCIs?
Teams can reduce FCIs by setting clear reduction goals, using observability tools to track failures, embedding operations within development teams, prioritizing customer issues during on-call rotations, and leveraging autonomous management platforms like Sedai to automate remediation and scaling.
Sedai Platform Features & Capabilities
What is Sedai and what does it do?
Sedai is an autonomous cloud management platform that optimizes cloud operations for cost, performance, and availability using machine learning. It eliminates manual intervention, reduces cloud costs by up to 50%, improves performance by reducing latency by up to 75%, and proactively resolves issues before they impact users. Sedai supports AWS, Azure, GCP, and Kubernetes environments. Learn more.
What are the key features of Sedai's platform?
Sedai offers autonomous optimization, proactive issue resolution, full-stack cloud coverage, smart SLOs, release intelligence, plug-and-play implementation, multiple modes of operation (Datapilot, Copilot, Autopilot), enhanced productivity, and safety-by-design. These features help businesses optimize costs, improve performance, and ensure reliability. Read more.
How does Sedai help reduce cloud costs?
Sedai reduces cloud costs by up to 50% through autonomous optimization, rightsizing workloads, and eliminating resource waste. Customers like Palo Alto Networks have saved $3.5 million, and KnowBe4 achieved 50% cost savings in production using Sedai. See case study.
What integrations does Sedai support?
Sedai integrates with monitoring and APM tools (Cloudwatch, Prometheus, Datadog, Azure Monitor), Kubernetes autoscalers (HPA/VPA, Karpenter), IaC and CI/CD tools (GitLab, GitHub, Bitbucket, Terraform), ITSM platforms (ServiceNow, Jira), notification tools (Slack, Microsoft Teams), and various runbook automation platforms. Learn more.
What security certifications does Sedai have?
Sedai is SOC 2 certified, demonstrating adherence to stringent security and compliance standards for data protection. See details.
How quickly can Sedai be implemented?
Sedai offers a plug-and-play implementation that takes just 5 minutes for general use cases and up to 15 minutes for scenarios like AWS Lambda. The platform connects securely to cloud accounts using IAM, with no need for agents or complex installations. Get started.
What support resources are available for Sedai users?
Sedai provides detailed technical documentation, a community Slack channel, email/phone support, and personalized onboarding sessions. Enterprise customers receive a dedicated Customer Success Manager. Access documentation.
What are the modes of operation in Sedai?
Sedai offers three modes of operation: Datapilot (observability), Copilot (one-click optimizations), and Autopilot (fully autonomous execution). This flexibility allows teams to choose the level of automation that fits their needs.
How does Sedai ensure safe and auditable changes?
Sedai integrates with Infrastructure as Code (IaC), IT Service Management (ITSM), and compliance workflows to ensure all changes are safe, validated, and auditable. The platform supports automatic rollbacks and incremental changes for risk-free automation.
Use Cases, Benefits & Customer Success
Who can benefit from using Sedai?
Sedai is designed for platform engineers, IT/cloud operations teams, technology leaders (CTO, CIO, VP Engineering), site reliability engineers (SREs), and FinOps professionals. It is ideal for organizations with significant cloud operations across industries such as cybersecurity, IT, financial services, healthcare, travel, and e-commerce. See case studies.
What business impact can Sedai deliver?
Sedai delivers up to 50% cloud cost savings, 75% latency reduction, 6X productivity gains, and up to 50% fewer failed customer interactions. Customers like Palo Alto Networks saved $3.5 million, KnowBe4 achieved 50% cost savings, and Belcorp reduced AWS Lambda latency by 77%. See more success stories.
What pain points does Sedai address for cloud teams?
Sedai addresses pain points such as operational toil, cost inefficiencies, performance bottlenecks, lack of proactive issue resolution, complexity in multi-cloud environments, and misaligned priorities between engineering and FinOps teams. The platform automates routine tasks, aligns goals, and provides actionable insights for optimization.
Can you share specific customer success stories with Sedai?
Yes. KnowBe4 achieved 50% cost savings and saved $1.2 million on AWS bills. Palo Alto Networks saved $3.5 million, reduced Kubernetes costs by 46%, and saved 7,500 engineering hours. Belcorp reduced AWS Lambda latency by 77%. Read KnowBe4's story | Read Palo Alto Networks' story.
What industries are represented in Sedai's case studies?
Sedai's case studies cover industries such as cybersecurity (Palo Alto Networks), IT (HP), financial services (Experian, CapitalOne Bank), security awareness training (KnowBe4), travel (Expedia), healthcare (GSK), car rental (Avis), retail/e-commerce (Belcorp), SaaS (Freshworks), and digital commerce (Campspot). See all case studies.
How does Sedai compare to other cloud optimization tools?
Sedai differentiates itself with 100% autonomous optimization, proactive issue resolution, application-aware intelligence, full-stack cloud coverage, release intelligence, and rapid plug-and-play implementation. Unlike competitors that rely on manual adjustments or static rules, Sedai operates autonomously and holistically, delivering measurable ROI and productivity gains. Learn more.
What feedback have customers given about Sedai's ease of use?
Customers praise Sedai for its quick setup (5–15 minutes), agentless integration, personalized onboarding, comprehensive documentation, and risk-free 30-day trial. These features make adoption smooth and accessible for teams of all sizes. Try Sedai.
FCIs Are the New Availability
S
Sedai
Content Writer
April 26, 2024
Featured
Note: this article is a transcript of a talk by Siddharth Ram, current CTO of Velocity Global, at autocon/22.
Introduction
This article delves into Failed Customer Interactions (FCIs), a critical metric that requires our attention. It provides valuable insights into system availability from a customer-centric standpoint. By assessing the success or failure of customer interactions, FCIs offer a way to evaluate the performance and effectiveness of a system. The article aims to demystify FCIs by examining their definition, measurement techniques, and their crucial role in achieving operational excellence. Furthermore, it explores how the powerful tool, Sedai, supports reductions in FCIs. You can watch the full video here.
The Pillars of SaaS Success: Engineering and Operational Excellence
Let's talk about running a SaaS company and the importance of operational and engineering excellence. When it comes to the backbone of any SaaS company, these two aspects play a crucial role. Engineering excellence encompasses all the tasks involved in creating and deploying software into production.
When we talk about engineering excellence, it involves a variety of activities. First and foremost, you need to write the code that powers your software. This includes designing and developing the features and functionalities that make your SaaS product unique and valuable. Additionally, you need to consider the user experience and ensure that it aligns with your customers' needs and expectations. Achieving product-market fit is also a critical aspect of engineering excellence, as you want your software to address a specific market's demands effectively.
Moreover, once the software is developed, it needs to go through a continuous integration and continuous deployment (CI/CD) pipeline. This pipeline ensures that the software is thoroughly tested, reviewed, and deployed seamlessly into production. The CI/CD pipeline helps streamline the development process and ensures that any changes or updates to the software are delivered efficiently and with minimal disruptions. On the other hand, Operational Excellence focuses on what happens to your software after it has been deployed in a production environment. It involves the ongoing maintenance, management, and optimization of the software to keep it running smoothly and deliver an excellent customer experience. Operational excellence encompasses tasks such as monitoring the software's performance, addressing any issues or bugs that may arise, and making continuous improvements to enhance the overall quality and efficiency of the product.
By prioritizing operational and engineering excellence, a SaaS company can ensure that its software is not only developed with precision and attention to detail but also maintained and optimized to provide a seamless user experience. These two aspects work hand in hand to create a strong foundation for a successful SaaS business, as they contribute to the overall reliability, performance, and customer satisfaction of the software product.
Tracking Metrics for Engineering and Operational Excellence
To track these aspects, certain metrics are commonly used. For engineering excellence, metrics like escapes to production, code coverage, defects out of SLA, and release velocity are important. These metrics assess the quality, test coverage, adherence to service level agreements, and speed of software deployment.
Operational excellence metrics include performance, availability, FCI, and cost management. Performance measures the speed and efficiency of the software, availability ensures it remains accessible to users, FCI focuses on delivering value quickly to customers, and cost management optimizes resource utilization.
Evaluating Availability and Failed Customer Interactions
Let's narrow our focus to two crucial aspects: availability and handling failed customer interactions. Starting with availability, it has long been a well-established metric in the industry. We can find a concise definition of availability, which involves subtracting the impacted time from the total time and then dividing it by the total time. This metric originated during a time when monolithic architectures dominated the technology landscape. It operated on the premise that there was a single machine connected via an Ethernet cable, and availability was determined by whether the network or the machine itself was functioning.
Transitioning from Monoliths to Microservices: A Paradigm Shift in Availability Measurement
The traditional metric of availability, which measures how often a system is up and running, was designed for an era when monolithic systems were dominant. Monolithic systems are a type of software architecture where everything is interconnected and dependent on a single machine. Back then, availability was simply determined by whether that machine or its network connection was functioning.
However, in today's world, where distributed architectures are more common, this traditional metric falls short. Distributed systems involve multiple interconnected components like microservices, containers, and cloud computing. Availability in these systems is not solely dependent on a single machine or network connection but involves various factors. Therefore, it's important to recognize that the old metric of availability was better suited for monolithic systems and mainly focused on the backend system. To evaluate availability in modern architectures, alternative metrics that consider individual microservices, network communication, and other relevant factors need to be explored.
Let's delve into the world of system architectures and the challenges they bring to measuring availability. Picture your system transforming from a single dot to a multi-dimensional cube. On the X-axis, you have read replicas, distributing load and enabling concurrency. The Y-axis represents the concept of microservices, separating functionality into individual services for scalability and specialized teams. Finally, the Z-axis introduces swim lanes, scaling resiliency by segregating customers or functionality.
With this intricate system structure, calculating availability becomes complex. What if a microservice goes down, impacting a specific function? How do we measure availability in such cases? Consider the scenario where your US customers reside in one swim lane and Canadian customers in another. If the Canadian cluster experiences an outage, how should availability be measured? It's a challenging puzzle that demands careful thought.
In light of these complexities, a more modern and effective approach is to shift our perspective from time-based metrics to count-based metrics. Instead of solely focusing on the duration of outages, we consider the frequency of service failures. This count-based approach provides a more nuanced understanding of availability, allowing us to identify patterns, isolate faults, and take proactive measures. By embracing this shift, we gain greater insight into system resilience and can make targeted improvements. Rather than waiting for the entire system to be affected, we can address specific service failures promptly. Count-based metrics enable us to navigate the intricacies of modern system architectures and ensure their availability.
Why time-based metrics aren’t good enough
Availability doesn't always reflect the severity of an incident accurately, nor does it measure the customer experience. It also lacks clear ownership, as different teams may be responsible for implementing changes. Additionally, availability doesn't capture task completion rate, which is crucial for assessing if customers can easily accomplish their goals on a webpage.
To truly improve customer experience, we need to go beyond availability metrics. Factors like usability, performance, and task completion rates should be considered for a holistic understanding. Remember, the impact on customers doesn't always align with the severity of an incident. So, relying solely on availability can lead to a skewed understanding of the user experience. By considering factors like usability, performance, and task completion rates, we can gain a more comprehensive view of the customer journey and make informed decisions to enhance their overall satisfaction.
Customer-Centric Metric: Measuring and Enhancing the Customer Experience for Success
It is imperative that we redirect our attention towards a customer-centric metric that not only gauges but also enhances the customer experience, fosters team ownership, and elevates. It has also a direct impact on NPS scores, as it measures the happiness and satisfaction of our customers. Surprisingly, such a metric already exists in the industry, although it's not widely utilized as it should be.
What is a failed Customer Interaction?
Understanding the concept of a failed customer interaction is straightforward. It involves calculating the total number of unsuccessful interactions and dividing it by the overall number of interactions. This measurement focuses on the account level rather than considering time as a factor. However, it is possible to incorporate time-based analysis to determine the occurrence rate of these incidents within a given period.
What exactly constitutes a Failed Customer Interaction (FCI)? Let's delve into it. FCIs encompass instances such as encountering 500 errors or frequent occurrences of 404 errors. These errors, which many of us have encountered ourselves, can happen on various platforms, including popular ones like Google, Amazon, and even Bank of America (which seems to be a particular subject of disdain for someone!). FCIs are an inherent part of any system that serves a considerable user base, as there will inevitably be situations where users unintentionally trigger actions leading to failed interactions, resulting in internal server errors.
To illustrate this, let me share a personal experience. Just a couple of weeks ago, while accessing my credit card bills on chase.com, I was taken back when I encountered an unmanaged Apache error—an unexpected 500 error. It was surprising to witness such a flaw in a well-established company like Chase. However, this incident emphasizes the inherent complexity of software and the likelihood of occasional glitches. Whenever a 500 error occurs within the system, tracing it back to the browser reveals a frustrated customer who was unable to accomplish their intended task.
Elevating Customer Experience: Unveiling FCIs Through Observability Platforms
Let me share an important insight, a secret that deserves widespread attention, perhaps even displayed on a prominent billboard along Highway 101. The observatory platform you have at your disposal possesses the capability to effortlessly measure FCIs. Surprisingly, it's already equipped to do so; we just haven't been paying enough attention to it. To share an example from my experience at Good Hire or Inflection, where we utilized Datadog. In a matter of minutes, we set up a dashboard that provided us with valuable insights into our fail customer interaction rate. Unfortunately, there aren't many established benchmarks available for comparison in this area. However, through conversations with colleagues and friends at AWS, I discovered that their benchmark for failed customer interactions is less than 0.025% of the total transactions within the system. To put this into perspective, if you have a million transactions, according to Amazon, the number of failures should be less than 250. This benchmark serves as a helpful context to gauge your own performance.
GoodHire's Experience Reducing FCIs
To present a compelling case study from my previous company. At the outset, our fail customer interaction (FCI) rate stood at 3.2%. Considering my extensive experience in system analysis, I must emphasize that actively monitoring and addressing these numbers is crucial. Starting at 3.2% is actually quite commendable. Subsequently, we incorporated FCI reduction as a key component of our quarterly plan, aiming to achieve a remarkable target of 0.025%. Remarkably, we surpassed our goal, reaching an impressive 0.02% FCI rate after nearly three quarters of concerted effort. However, the team didn't stop there; they went above and beyond, diligently working towards rendering FCIs virtually undetectable within our system.
Now, let's explore the value derived from this endeavor. The true worth lies in the Net Promoter Scores (NPS) – an essential metric for assessing customer satisfaction. While I cannot establish a direct causation, it is worth noting the positive correlation between our FCI reduction efforts and the NPS scores of our product. Over time, we witnessed a notable increase in NPS scores, progressing from 63 to 68 and eventually reaching a remarkable 70. This remarkable improvement further validates the importance of investing efforts into reducing FCIs and ultimately enhancing customer satisfaction.
Implementing FCIs
Naturally, there were other contributing factors that influenced the outcomes and propelled progress. However, these findings align with my expectations when customers can effectively utilize the system, perform their tasks seamlessly, and experience satisfactory system performance. It's no surprise that such improvements lead to a higher NPS score.
Let's explore the practical implementation of FCIs. A significant aspect of our success was the adoption of Lambda, which allowed us to easily deploy Sedai on top of it. Moreover, once the system was up and running, our SRE team could divert their attention elsewhere as Lambda handled the operational aspects smoothly. However, it's important to note that 500 errors can stem from various causes. Scaling issues, where inadequate resources lead to dropped requests or delayed timers, can be problematic. In our case, using autonomous management techniques helped us make substantial progress in addressing these issues. It's worth mentioning that when running C# and .NET stacks on Lambda, we encountered the highest latency and struggled with cold starts, often exceeding 15 seconds. Clearly, expecting customers to wait for such extended periods is a significant challenge that needs attention.
Furthermore, mindset played a crucial role. It became a priority to emphasize that solving customer problems was our core objective. As part of this commitment, we set a goal to minimize FCIs, and we successfully achieved that. For more insights on bridging the customer-engineering gap, I have provided a detailed write-up on cio.com that you may find interesting. Additionally, implementing an on-call program proved beneficial. Instead of engineers focusing solely on their backlog items during their on-call week, their responsibility shifted towards collaborating with customer success teams. They actively worked on understanding customer concerns, investigating error logs through tools like Datadog or Splunk, and resolving issues promptly. This change in approach ensured a customer-centric mindset and drove continuous improvement. Kindly proceed to the next slide for further details.
Driving Success: How Sedai's Autonomous System Transformed Operations and Delighted Customers
Lastly, let's delve into the utilization of Sedai in our operations. Our implementation of Sedai was characterized by its fully autonomous nature. While we didn't have access to the remarkable Lambda extensions that could have further mitigated cold starts, we managed to address the minor cold start challenges by leveraging provision concurrency. As a result, scaling issues, often associated with errors, became significantly reduced. The majority of our encountered errors stemmed from coding-related matters, such as overlooked code paths.
Consequently, the need for a dedicated team of SREs solely focusing on serverless operations diminished. We only had two SREs handling the remaining components within the monolith, while the serverless architecture required minimal attention. Weekly reviews of Sedai's performance and dashboards ensured efficient vertical and horizontal scaling. The release of intelligent metrics provided by Sedai were particularly intriguing, alerting us to any irregularities or concerns that warranted immediate attention. This empowered us to make well-informed decisions regarding release deployments.
The serverless side of our workload witnessed a remarkable improvement of approximately 90%, specifically in terms of mundane tasks. This progress coincided with our transition away from Kubernetes and complete adoption of serverless infrastructure. This strategic shift brought about significant enhancements to our operations and overall productivity.
To summarize, FCIs play a vital role in today's landscape. The rise of autonomous management, regardless of infrastructure choices, signifies a progressive approach. By prioritizing failed customer interactions and implementing autonomous strategies, businesses can unlock remarkable enhancements in customer satisfaction and overall experience.
Q&A
Q: Could you provide an overview of how FCIs are defined and measured in terms of customer experience at a single step level?
A: Each microservice is owned by a scrum team, and they are responsible for tracking FCIs for their respective services. Each scrum team has a dedicated dashboard showing the number of FCIs they've had in a given week. Based on their priorities and plans for the next iteration, the team decides which FCIs to focus on and address. As a leader, I set quarterly goals for reducing FCIs, and everyone works towards achieving those targets. Datadog played a significant role in helping us effectively manage FCIs.
Q: You mentioned a 90% reduction in busy work. How was the saved time and resources reinvested?
A: Some of our Kubernetes specialists, realizing that Kubernetes wasn't the right fit for our needs, chose to pursue other opportunities, and we supported them in finding new roles. Instead of filling those positions, we hired additional development engineers for our scrum teams. I believe in empowering scrum teams with end-to-end ownership, including operations, and we embedded operations responsibilities within the teams. This allowed us to utilize the saved resources in enhancing team capabilities and overall efficiency.
Q: So, a shift towards embedding operations within the scrum teams?
A: Yes, precisely. Embedding operations within scrum teams is the right model, especially in a cloud environment. Operations, including aspects like cost management, became the responsibility of the scrum teams.