Browse through the showcased feeds, or enter a feed URL below.
A feed by Google Cloud Platform
Permalink - Posted on 2021-01-15 15:05
The issue with Cloud Interconnect has been resolved for all affected projects as of Friday, 2021-01-15 06:53 US/Pacific. We thank you for your patience while we worked on resolving the issue.
Permalink - Posted on 2021-01-15 14:14
Description: Mitigation work is currently underway by our engineering team. We do not have an ETA for mitigation at this point. We will provide more information by Friday, 2021-01-15 07:25 US/Pacific. Diagnosis: None at this time. Workaround: None at this time.
Permalink - Posted on 2021-01-15 13:34
Description: We are experiencing an issue with Cloud Interconnect beginning at Friday, 2021-01-15 04:36:30 PST. Our engineering team continues to investigate the issue. We will provide an update by Friday, 2021-01-15 06:15 US/Pacific with current details. We apologize to all who are affected by the disruption. Diagnosis: None at this time. Workaround: None at this time.
Permalink - Posted on 2021-01-14 20:15
Description: Mitigation work is still underway by our engineering team. The mitigation is expected to complete by Tuesday, 2021-01-19 12:00 US/Pacific. Please see the workaround section below for more details. Diagnosis: The command "gcloud components update" fails for Cloud SDK versions 321, 322 and 323 installed on Windows. Workaround: Please run the following commands in a PowerShell window: $gcloudDir = Get-Command gcloud | Select -ExpandProperty "Source" | Split-Path | Split-Path attrib -r "$gcloudDir\platform\kuberun_licenses\*.*" /s attrib -r "$gcloudDir\lib\kuberun\*.*" /s attrib -r "$gcloudDir\..\google-cloud-sdk.staging\platform\kuberun_licenses\*.*" /s attrib -r "$gcloudDir\..\google-cloud-sdk.staging\lib\kuberun\*.*" /s Remove-Item "$gcloudDir\..\google-cloud-sdk.staging" -Recurse If any of the commands fail, proceed with running the remaining commands. After running the PowerShell script, run the following in a regular Command Prompt (not PowerShell): gcloud components update --version 320.0.0 Please note, after applying this workaround, do not run 'gcloud components update' as this will re-trigger the issue. Please wait until the fix is released before updating components.
Permalink - Posted on 2021-01-08 21:20
The issue with Cloud L7 (HTTP) External Load Balancer components has been resolved for all affected users as of Friday, 2021-01-08 12:21 US/Pacific. We thank you for your patience while we worked on resolving the issue.
Permalink - Posted on 2021-01-08 20:31
Description: We are experiencing an intermittent issue with Cloud L7 (HTTP) External Load Balancer components beginning at Friday, 2021-01-08 11:39 US/Pacific. Our engineering team continues to investigate the issue. We will provide an update by Friday, 2021-01-08 14:00 US/Pacific with current details. We apologize to all who are affected by the disruption. Diagnosis: Cloud L7 (HTTP) External LB updates appear stalled - customers will likely see delays to updating their projects. Workaround: None at this time.
Permalink - Posted on 2021-01-08 20:27
Description: We are experiencing an intermittent issue with Google Cloud configuration infrastructure components beginning at Friday, 2021-01-08 11:39 US/Pacific. Our engineering team continues to investigate the issue. We will provide an update by Friday, 2021-01-08 14:00 US/Pacific with current details. We apologize to all who are affected by the disruption. Diagnosis: Cloud customer updates appear stalled - customers will likely see delays to updating their projects. Workaround: None at this time.
Permalink - Posted on 2020-12-23 00:49
The following is a correction to the previously posted ISSUE SUMMARY, which after further research we determined needed an amendment. All services that require sign-in via a Google Account were affected with varying impact. Some operations with Cloud service accounts experienced elevated error rates on requests to the following endpoints: www.googleapis.com or oauth2.googleapis.com. Impact varied based on the Cloud Service and service account. Please open a support case if you were impacted and have further questions.
Permalink - Posted on 2020-12-18 19:37
# ISSUE SUMMARY On Monday 14 December, 2020, for a duration of 47 minutes, customer-facing Google services that required Google OAuth access were unavailable. Cloud Service accounts used by GCP workloads were not impacted and continued to function. We apologize to our customers whose services or businesses were impacted during this incident, and we are taking immediate steps to improve the platform’s performance and availability. # ROOT CAUSE The Google User ID Service maintains a unique identifier for every account and handles authentication credentials for OAuth tokens and cookies. It stores account data in a distributed database, which uses Paxos protocols to coordinate updates. For security reasons, this service will reject requests when it detects outdated data. Google uses an evolving suite of automation tools to manage the quota of various resources allocated for services. As part of an ongoing migration of the User ID Service to a new quota system, a change was made in October to register the User ID Service with the new quota system, but parts of the previous quota system were left in place which incorrectly reported the usage for the User ID Service as 0. An existing grace period on enforcing quota restrictions delayed the impact, which eventually expired, triggering automated quota systems to decrease the quota allowed for the User ID service and triggering this incident. Existing safety checks exist to prevent many unintended quota changes, but at the time they did not cover the scenario of zero reported load for a single service: • Quota changes to large number of users, since only a single group was the target of the change, • Lowering quota below usage, since the reported usage was inaccurately being reported as zero, • Excessive quota reduction to storage systems, since no alert fired during the grace period, • Low quota, since the difference between usage and quota exceeded the protection limit. As a result, the quota for the account database was reduced, which prevented the Paxos leader from writing. Shortly after, the majority of read operations became outdated which resulted in errors on authentication lookups. # REMEDIATION AND PREVENTION The scope of the problem was immediately clear as the new quotas took effect. This was detected by automated alerts for capacity at 2020-12-14 03:43 US/Pacific, and for errors with the User ID Service starting at 03:46, which paged Google Engineers at 03:48 within one minute of customer impact. At 04:08 the root cause and a potential fix were identified, which led to disabling the quota enforcement in one datacenter at 04:22. This quickly improved the situation, and at 04:27 the same mitigation was applied to all datacenters, which returned error rates to normal levels by 04:33. As outlined below, some user services took longer to fully recover. In addition to fixing the underlying cause, we will be implementing changes to prevent, reduce the impact of, and better communicate about this type of failure in several ways: 1\. Review our quota management automation to prevent fast implementation of global changes 2\. Improve monitoring and alerting to catch incorrect configurations sooner 3\. Improve reliability of tools and procedures for posting external communications during outages that affect internal tools 4\. Evaluate and implement improved write failure resilience into our User ID service database 5\. Improve resilience of GCP Services to more strictly limit the impact to the data plane during User ID Service failures We would like to apologize for the scope of impact that this incident had on our customers and their businesses. We take any incident that affects the availability and reliability of our customers extremely seriously, particularly incidents which span multiple regions. We are conducting a thorough investigation of the incident and will be making the changes which result from that investigation our top priority in Google Engineering. # DETAILED DESCRIPTION OF IMPACT On Monday 14 December, 2020 from 03:46 to 04:33 US/Pacific, credential issuance and account metadata lookups for all Google user accounts failed. As a result, we could not verify that user requests were authenticated and served 5xx errors on virtually all authenticated traffic. The majority of authenticated services experienced similar control plane impact: elevated error rates across all Google Cloud Platform and Google Workspace APIs and Consoles. Products continued to deliver service normally during the incident except where specifically called out below. Most services recovered automatically within a short period of time after the main issue ended at 04:33. Some services had unique or lingering impact, which is detailed below. #### Cloud Console Any users who hadn't already previously authenticated to Cloud Console were unable to login. Users who had already authenticated may have been able to use Cloud Console but may have seen some features degraded. #### Google BigQuery During the incident, streaming requests returned ~75% errors, while BigQuery jobs returned ~10% errors on average globally. #### Google Cloud Storage Approximately 15% of requests to Google Cloud Storage (GCS) were impacted during the outage, specifically those using OAuth, HMAC or email authentication. After 2020-12-14 04:31 US/Pacific, the majority of impact was resolved, however, there was lingering impact, for <1% of clients that attempted to finalize resumable uploads that started during the window. These uploads were left in a non-resumable state; the error code GCS returned was retryable, but subsequent retries were unable to make progress, leaving these objects unfinalized. #### Google Cloud Networking The networking control plane continued to see elevated error rates on operations until it fully recovered at 2020-12-14 05:21 US/Pacific. Only operations that made modifications to the data plane VPC network were impacted. All existing configurations in the data plane remained operational. #### Google Kubernetes Engine During the incident, ~4% of requests to the GKE control plane API failed, and nearly all Google-managed and customer workloads could not report metrics to Cloud Monitoring. We believe ~5% of requests to Kubernetes control planes failed but do not have accurate measures due to unreported Cloud Monitoring metrics. For up to an hour after the outage, ~1.9% nodes reported conditions such as StartGracePeriod or NetworkUnavailable which may have had an impact on user workloads. #### Google Workspace All Google Workspace services rely on Google's account infrastructure for login, authentication, and enforcement of access control on resources (e.g. documents, Calendar events, Gmail messages). As a consequence, all authenticated Google Workspace apps were down for the duration of the incident. After the issue was mitigated at 2020-12-14 04:32 US/Pacific, Google Workspace apps recovered, and most services were fully recovered by 05:00. Some services, including Google Calendar and Google Workspace Admin Console, served errors up to 05:21 due to a traffic spike following initial recovery. Some Gmail users experienced errors for up to an hour after recovery due to caching of errors from identity services. #### Cloud Support Cloud Support's internal tools were impacted, which delayed our ability to share outage communications with customers on the Google Cloud Platform and Google Workspace Status Dashboards. Customers were unable to create or view cases in the Cloud Console. We were able to update customers at 2020-12-14 05:34 US/Pacific after the impact had ended.
Permalink - Posted on 2020-12-15 19:03
# ISSUE SUMMARY On Wednesday 9 December, 2020, Google Cloud Platform experienced networking unavailability in zone europe-west2-a, resulting in some customers being unable to access their resources, for a duration of 1 hour 24 minutes. The following Google services had degraded service that extended beyond the initial 1 hour 24 minute network disruption: - 1.5% of Cloud Memorystore Redis instances were unhealthy for a total duration of 2 hours 24 minutes - 4.5% of Classic Cloud VPN tunnels in the europe-west2 region experienced unavailability after the main disruption had recovered and these tunnels remained down for a duration of 8 hours and 10 minutes - App Engine Flex experienced increased deployment error rates for a total duration of 1 hour 45 minutes We apologize to our Cloud customers who were impacted during this disruption. We have conducted a thorough internal investigation and are taking immediate action to improve the resiliency and availability of our service. # ROOT CAUSE Google’s underlying networking control plane consists of multiple distributed components that make up the Software Defined Networking (SDN) stack. These components run on multiple machines so that failure of a machine or even multiple machines does not impact network capacity. To achieve this, the control plane elects a leader from a pool of machines to provide configuration to the various infrastructure components. The leader election process depends on a local instance of Google’s internal lock service to read various configurations and files for determining the leader. The control plane is responsible for Border Gateway Protocol (BGP) peering sessions between physical routers connecting a cloud zone to the Google backbone. Google’s internal lock service provides Access Control List (ACLs) mechanisms to control reading and writing of various files stored in the service. A change to the ACLs used by the network control plane caused the tasks responsible for leader election to no longer have access to the files required for the process. The production environment contained ACLs not present in the staging or canary environments due to those environments being rebuilt using updated processes during previous maintenance events. This meant that some of the ACLs removed in the change were in use in europe-west2-a, and the validation of the configuration change in testing and canary environments did not surface the issue. Google's resilience strategy relies on the principle of defense in depth. Specifically, despite the network control infrastructure being designed to be highly resilient, the network is designed to 'fail static' and run for a period of time without the control plane being present as an additional line of defense against failure. The network ran normally for a short period - several minutes - after the control plane had been unable to elect a leader task. After this period, BGP routing between europe-west2-a and the rest of the Google backbone network was withdrawn, resulting in isolation of the zone and inaccessibility of resources in the zone. # REMEDIATION AND PREVENTION Google engineers were automatically alerted to elevated error rates in europe-west2-a at 2020-12-09 18:29 US/Pacific and immediately started an investigation. The configuration change rollout was automatically halted as soon as the issue was detected, preventing it from reaching any other zones. At 19:30, mitigation was applied to rollback the configuration change in europe-west2-a. This completed at 19:55, mitigating the immediate issue. Some services such as Cloud MemoryStore and Cloud VPN took additional time to recover due to complications arising from the initial disruption. Services with extended recovery timelines are described in the “detailed description of impact” section below. We are committed to preventing this situation from happening again and are implementing the following actions: In addition to rolling back the configuration change responsible for this disruption, we are auditing all network ACLs to ensure they are consistent across environments. While the network continued to operate for a short time after the change was rolled out, we are improving the operating mode of the data plane when the control plane is unavailable for extended periods. Improvements in visibility to recent changes will be made to reduce the time to mitigation. Additional observability will be added to lock service ACLs allowing for additional validation when making changes to ACLs. We are also improving the canary and release process for future changes of this type to ensure these changes are made safely. # DETAILED DESCRIPTION OF IMPACT On Wednesday 9 December, 2020 from 18:31 to 19:55 US/Pacific Google Cloud experienced unavailability for some Google services hosted in zone europe-west2-a as described in detail below. If impact time differs significantly, it will be mentioned specifically. ## Compute Engine ~60% of VMs in europe-west2-a were unreachable from outside the zone. Projects affected by this incident would have observed 100% of VMs in the zone being unreachable. Communication within the zone had minor issues, but largely worked normally. VM creation and deletion operations were stalled during the outage. VMs on hosts that had hardware or other faults during the outage were not repaired and restarted onto healthy hosts during the outage. ## Persistent Disk VMs in europe-west2-a experienced stuck I/O operations for 59% of standard persistent disks located in that zone. 27% of regional persistent disks in europe-west2 briefly experienced high I/O latency at the start and end of the incident. Persistent Disk snapshot creation and restore for 59% of disks located in europe-west2-a failed during the incident. Additionally, snapshot creation for Regional Persistent Disks with one replica located in zone europe-west2-a was unavailable. ## Cloud SQL ~79% of HA Cloud SQL instances experienced <5 minutes of downtime due to autofailover with an additional ~5% experiencing <25m of downtime after manual recovery. ~13% of HA Cloud SQL instances with legacy HA configuration did not failover because the replicas were out of sync, and were unreachable for the full duration of the incident. The remaining HA Cloud SQL instances did not failover due to stuck operations. Overall, 97.5% of Regional PD based HA instances and 23% of legacy MySQL HA instances had <25m downtime with the remaining instances being unconnectable during the outage. Google engineering is committed to improving the successful failover rate for Cloud SQL HA instances for zonal outages like this. ## Google App Engine App Engine Flex apps in europe-west2 experienced increased deployment error rates between 10% and 100% from 18:44 to 20:29. App Engine Standard apps running in the europe-west2 region experienced increased deployment error rates of up to 9.6% that lasted from 18:38 to 18:47. ~34.7% of App Engine Standard apps in the region experienced increased serving error rates between 18:32 and 18:38. ## Cloud Functions 34.8% of Cloud Functions served from europe-west2 experienced increased serving error rates between 18:32 and 18:38. ## Cloud Run 54.8% of Cloud Run apps served from europe-west2 experienced increased serving error rates between 18:32 and 18:38. ## Cloud MemoryStore ~10% of Redis instances in europe-west2, were unreachable during the outage. Both standard tier and basic tier instances were affected. After the main outage was mitigated, most instances recovered, but ~1.5% of instances remained unhealthy for 60 minutes before recovering on their own. ## Cloud Filestore ~16% of Filestore instances in europe-west2 were unhealthy. Instances in the zone were unreachable from outside the zone, but access within the zone was largely unaffected. ## Cloud Bigtable 100% of single-homed Cloud Bigtable instances in europe-west2-a were unavailable during the outage, translating into 100% error rate for customer instances located in this zone. ## Kubernetes Engine ~67% of cluster control planes in europe-west2-a and 10% of regional clusters in europe-west2 were unavailable for the duration of the incident. Investigation into the regional cluster control plane unavailability is still ongoing. Node creation and deletion operations were stalled due to the impact to Compute Engine operations. ## Cloud Interconnect Elevated packet loss for zones in europe-west2. Starting at 18:31 packets destined for resources in europe-west2-a experienced loss for the duration of the incident. Additionally, interconnect attachments in europe-west2 experienced regional loss for 7 minutes at 18:31 and 8 minutes at 19:53. ## Cloud Dataflow ~10% of jobs in europe-west2 failed or got stuck in cancellation during the outage. ~40% of Dataflow Streaming Engine jobs in the region were degraded over the course of the incident. ## Cloud VPN A number of Cloud VPN tunnels were reset during the disruption and were automatically relocated to other zones in the region. This is within the design of the product, as the loss of one zone is planned. However once zone europe-west2-a reconnected to the network, a combination of bugs in the VPN control plane were triggered by some of the now stale VPN gateways in the zone. This caused an outage to 4.5% of Classic Cloud VPN tunnels in europe-west2 for a duration of 8 hours and 10 minutes after the main disruption had recovered. ## Cloud Dataproc ~0.01% of Dataproc API requests to europe-west2 returned UNAVAILABLE during the incident. The majority of these requests were read-only requests (ListClusters, ListJobs, etc.) # SLA CREDITS If you believe your paid application experienced an SLA violation as a result of this incident, please submit the SLA credit request: https://support.google.com/cloud/contact/cloud_platform_sla A full list of all Google Cloud Platform Service Level Agreements can be found at https://cloud.google.com/terms/sla/
Permalink - Posted on 2020-12-15 12:13
The issue with Cloud Router metrics has been resolved for all affected projects as of Tuesday, 2020-12-15 04:00 US/Pacific and metrics from Cloud Router should now be flowing to Metrics Explorer. We thank you for your patience while we worked on resolving the issue.
Permalink - Posted on 2020-12-15 10:20
Description: We are experiencing an issue with Cloud Router monitoring data (metrics) missing beginning at Monday, 2020-12-14 12:00 US/Pacific. Routers themselves are not impacted and should be working with no issues. Our engineering team continues to investigate the issue. We will provide an update by Tuesday, 2020-12-15 04:15 US/Pacific with current details. We apologize to all who are affected by the disruption. Diagnosis: Visible decrease in Cloud Router data in Metrics Explorer. Workaround: None at this time.
Permalink - Posted on 2020-12-14 20:17
Preliminary Incident Statement while full Incident Report is prepared. (All Times US/Pacific) Incident Start: 2020-12-14 03:45 Incident End: 2020-12-14 04:35 Duration: 50 minutes; ### Affected: - Services: Google Cloud Platform, Google Workspace - Features: Account login and authentication to all Cloud services - Regions/Zones: Global ### Description: Google Cloud Platform and Google Workspace experienced a global outage affecting all services which require Google account authentication for a duration of 50 minutes. The root cause was an issue in our automated quota management system which reduced capacity for Google's central identity management system, causing it to return errors globally. As a result, we couldn’t verify that user requests were authenticated and served errors to our users. ### Customer Impact: - GCP services (including Cloud Console, Cloud Storage, BigQuery, Google Kubernetes Engine) requiring authentication would have returned an error for all users. - Google Workspace services (including Gmail, Calendar, Meet, Docs and Drive) requiring authentication would have returned an error for all users. ### Additional Details: - Many of our internal users and tools experienced similar errors, which added delays to our outage external communication. - We will publish an analysis of this incident once we have completed our internal investigation.
Permalink - Posted on 2020-12-14 14:23
As of 4:32 PST the system affected was restored and all services recovered shortly afterwards.
Permalink - Posted on 2020-12-14 13:34
Google Cloud services are experiencing issues and we have an other update at 5:30 PST
Permalink - Posted on 2020-12-10 05:04
The issue with Cloud Memorystore has been resolved for all affected projects as of Wednesday, 2020-12-09 21:00 US/Pacific. We thank you for your patience while we worked on resolving the issue.
Permalink - Posted on 2020-12-10 04:59
Description: The underlying infrastructure issue in europe-west2-a has been mitigated, and we are seeing recoveries in most Cloud Memorystore instances. We will continue to monitor for full recovery, and provide more information by Wednesday, 2020-12-09 22:30 US/Pacific. Diagnosis: None at this time. Workaround: None at this time.
Permalink - Posted on 2020-12-10 04:43
The issue with Google Cloud infrastructure components is believed to be resolved for all services, however a small number of Compute resources may still be affected and our Engineering Team is working on it. If you have questions or are impacted, please open a case with the Support Team and we will work with you until this issue is resolved. No further updates will be provided here. We thank you for your patience while we're working on resolving the issue.
Permalink - Posted on 2020-12-10 04:42
The issue with Cloud Dataflow in europe-west2-a has been resolved for all affected projects as of Wednesday, 2020-12-09 20:41 US/Pacific. We thank you for your patience while we worked on resolving the issue.