GCP Optimization: Managing VM & Disk Operational Risk

Every SRE has heard this at least once in their career: trim the flab, cut the bill, but don’t let performance slip.

That’s easier said than done, especially when you’re dealing with Google Cloud Engine (GCE) VMs & persistent disks. Known for its complexity, optimizing within Google Cloud Platform (GCP) can cause operational havoc, and I’ve seen it firsthand.

For me, my GCP chaos started as a routine GCE optimization task on a Friday night that quickly became a lesson in humility. From that experience, I’ll share:

Why GCP VM & Disk Optimization Is Operationally Risky

When my team faced the GCP’s unpredictability, we were targeting some obvious cost savings by right-sizing an underutilized VM. On paper, it was the safest change imaginable: resize, restart, and pack my bag for the weekend.

Instead, it kicked off a six-hour firefighting session. The VM never came back up because the persistent disk didn’t reattach cleanly, tripping over an obscure regional lock that none of us had seen before.

What followed was half a night spent restoring the disks from snapshots, validating data integrity, and explaining to stakeholders why the maintenance window had to be extended.

By the time the dust settled, my batteries were drained and my weekend was shot.

We all know this isn’t an edge case. SREs and cloud infrastructure engineers are constantly asked to optimize for cost while preserving performance and stability — the core challenge of GCP cost optimization.

But changes to GCP’s VMs & disks raises the risk of downtime enough that many teams choose to delay optimization altogether.

It results in a familiar paradox: the cost of a potential outage feels higher than the cost of inefficiency.

Why VM & Disk Optimization in GCP Is So Complex

When you rightsize GCE VMs in GCP and upgrade the block storage, you can’t use a simple API call that reconfigures the environment. Optimization is an orchestrated series of events, each with the potential for failure.

The issue is simple but dangerous if not understood correctly: GCP VM instances & storage are separate resources, but they behave like a single system. So a change to one can impact the other.

To change a VM’s instance type or modify disk properties like size or type, you are almost always required to:

Shut down the VM
Detach the disk
Apply the change
Reattach the disk
Bring the VM back up

In theory, this sequence is straightforward. In reality, it’s orchestrated with the precision of a circus clown fired out of a cannon, hoping to land on a trampoline.

Between shutdown and start up you can face:

Mount misconfigurations
Quota limits
Regional locks that refuse to clear
Stale disk attachments
Replica sync delays
Mysteriously “busy” cloud APIs

No matter how many times you’ve done this before, one truth remains constant: With GCP, anything can go wrong at any time.

But, there are safe ways to optimize GCP and reduce your risk of downtime.

Start Optimizing Google Cloud

Talk to our GCP experts to see how Sedai can lower your costs.

How to Manually Optimize GCP

I’ve worked (and fought) with GCP long enough that I’ve developed a safe process for optimization, which avoids breaking anything fragile. It’s manual, but I’ve found it to be effective in reducing the downtime risk.

Assess the Fleet & Identify Optimization Candidates

To find the best optimization candidates, collect cloud monitoring data over an observation window of at least 14–30 days. Use this data to:

Understand how workloads behave within normal operating limits and seasonal patterns.
Identify VMs that are consistently underutilized across CPU, memory, and disk resources.

At the same time, analyze disk performance by comparing IOPS and throughput usage against the VM’s actual capacity. This helps identify over-provisioned storage.

VM Rightsizing

To start rightsizing, identify the right time to start optimizing. Validate potential downtime against real workload signals and coordinate with stakeholders outside of engineering to determine optimal downtime.

You can validate optimal maintenance windows with traffic telemetry data or CPU utilization data.

For the sake of this article, we’ll focus on a single-instance VM application. However, even applications with multiple VMs behind a load balancer benefit from a defined maintenance window.

Next, select a target machine type (SKU) that closely matches the workload’s necessary capacity with minimal waste. Verify the automount configuration of the GCP VM is configured appropriately, especially for non-boot persistent disks.

Before applying any changes, create a Persistent Disk snapshot as a baseline. This is your most critical rollback mechanism against data loss or corruption.

Finally, apply the new machine type and restart the VM. This implicitly triggers the required disk detach/re-attach cycle.

Disk Tuning

Disk tuning matters because in GCP, storage changes are rarely isolated. Even small adjustments to disk size or type are tightly coupled to VM behavior and can introduce real operational risk.

There are two reasons to tune disks: cost optimization and performance & capacity management.

Cost Optimization

Persistent disks are often over-provisioned, which quietly drives up cloud spend. You can identify what disks are over-provisioned by analyzing IOPS and throughput usage against a VM’s actual capacity. This can reduce cost without impacting performance.

Performance & Capacity Management

As workloads change, teams must tune disks to adjust capacity or performance. This is done by:

Upsizing: Increasing the disk size when more space is needed.
Downsizing: Changing the disk type (e.g. from standard to SSD) to balance performance, stability, and cost

You can often upsize while the VM is running using the Google Cloud console or gcloud command. Once upsized, the guest OS must expand the filesystem to use the new space.

To change type or downsize, you must shut down the VM. To do this, change the disk type or reduce the size. The VM's persistent disk will automatically detach and re-attach upon startup.

Extending the Partition and Filesystem

To use the space at both the partition and filesystem layers, start the process according to your VM’s operating system.

For Linux, You can use tools like fdisk, parted, or gparted to inspect and resize the partition table. In environments using logical volume manager (LVM), cloud-init or distribution-specific tools may assist with resizing, but manual resizing is still commonly required.

Once the partition is extended, the filesystem must also be resized:

Use resize2fs for ext2, ext3, or ext4 filesystems
Use xfs_growfs for XFS
Use lvextend for LVM volumes, followed by the appropriate filesystem resize command

For Windows, both partition and filesystem expansion are usually handled together through the disk management utility (diskmgmt.msc), making the process more straightforward.

It’s important to note that the exact commands and steps can vary significantly depending on the VM's OS, whether it's a boot disk or a secondary data disk, and the specific filesystem used.

Verify & Finalize

Once optimization is complete, confirm that both the infrastructure and application are behaving as expected.

Restart the VM if required and verify the following:

The new disk size/type is reflected in the Google Cloud console
The guest OS can access the disk and all application data

Once normal traffic is restored back to the virtual machine, analyze the telemetry and application behavior.

Cleanup

After a successful change and stability period, delete the old, stale snapshots and any decommissioned VM configurations to prevent accruing unnecessary storage costs.

From Manual Toil to Safe, Autonomous GCP Optimization

This kind of manual optimization does work. But the truth is, it’s not a scalable approach for most companies, which have hundreds or even thousands of VMs & disks. Your SRE team just can’t keep pace.

Through my own frustration with GCP, I realized that the process can and should be handled by an autonomous system.

My team and I have been working on this problem for a while, and we recently released a feature that can now right-size and tune your VMs & disks for you, all under your supervision.

Now, Sedai can analyze real CPU, evaluate disk IOPS and throughput usage, identify waste without impacting workloads, and ultimately, apply those optimizations for you.

And you still have control with the ability to approve or deny any change.

Autonomy allows you to optimize continuously across your entire environment without burning out your team. If you want to see how you can autonomously scale, see what our team has built for GCE VM & disk optimization in GCP.

The mechanics stay the same, you just don’t have to babysit them anymore. Get started and see how Sedai does it.