Google claims that a flaw in the automatic quota management system that affects the Google User ID Service was the global authentication system failure that most consumer-facing series impacted on Monday.
This global device malfunction stopped users of all Cloud providers from signing in to their accounts and authenticating.
As a consequence, for about an hour on Monday, December 14th, users have not had access to Gmail, YouTube, Google Drive, Google Maps, Google Calendar, and other Google services.
Users did not send emails to desktop clients by using Gmail mobile applications or email via POP3, and YouTube visitors saw error messages saying, “There was a problem with the server (503) – Tap to retry.”
Outage impact and root cause
“On Monday 14 December 2020 from 03:46 to 04:33 US/Pacific, credential issuance and account metadata lookups for all Google user accounts failed,” Google said. “We were therefore unable, in almost all authenticated traffic, to confirm that the user requests were authenticated and served 5xx errors.
“The majority of authenticated services experienced similar control plane impact: elevated error rates across all Google Cloud Platform and Workspace APIs and Consoles.”
Due to a bug in the automatic quote management system, the root cause of the malfunction was a replacement of Google’s core identity management system.
This resulted in difficulties in checking the authentication of Google account requests and failures in all authentication attempts.
Global identity management system
The Google User ID Service, which was at the center of Monday’s big Google failure, stores single identifiers for all Google users and handling both OAuth tokens and cookies for authentication.
It also stored user account data on a distributed database that uses Paxos protocols to authenticate updates.
As the User ID Program refuses demands for security purposes for the detection of obsolete info, all of the Google services customer facing Google OAuth access specifications became inaccessible right after complications started to arise and outdated recognition was released.
“Google uses an evolving suite of automation tools to manage the quota of various resources allocated for services,” the company said in a problem overview report that was released today.
“In October, a change was made in the new quota system to register a new service for user ID, but parts of the previous quota system remained in place, which misreported that the usage for the service was 0. “When the service was continually transferred to a new quota system.
“An existing grace period on enforcing quota restrictions delayed the impact, which eventually expired, triggering automated quota systems to decrease the quota allowed for the User ID service and triggering this incident.”
Although security checks are in effect to avoid unplanned adjustments in quotas, they could not respond appropriately to the zero recorded loads single-service scenario.
“As a result, the quota for the account database was reduced, which prevented the Paxos leader from writing,” added Google. “Shortly after, the majority of reading operations became outdated which resulted in errors on authentication lookups.”
Google said this big setback also impacted the internal customers and the tools of the organization and triggered delays during a stop-up inquiry and status change monitoring.