内容无法以要求的语言呈现,即将以内容原始语言进行显示。
11.3.2014

Outage Report and Postmortem

作者 Jake Vargas

On October 26th, access to a number of engine services went offline for the duration of around eight hours. This removed access for developers to download engine binaries and Marketplace content, as well as the ability to subscribe to the service. 

OUTAGE ONSET

2014/10/26 4:45AM ET

OUTAGE RESOLVED

2014/10/26 12:26PM ET

AFFECTED SERVERS AND SERVICES

Unreal Engine Client Launcher
buildinfo-public-service-prod06 (Buildinfo services)

OUTAGE IMPACT

Live launcher unable to download or load properly.
UE4 state: 'subscribe' 
Fortnite and Unreal Tournament clients state: 'Unavailable'
Marketplace items state: 'Syncing'

DATA LOSS

None

CAUSE

Java version running on servers was u45. This version has a known memory leak that caused a OOM killer to stop the Buildinfo service.

RESOLUTION

Update java version on Buildinfo servers and restart the service. 

FUTURE MITIGATION

1- Alerting on ELB (load balancer) issues to properly notify as P0 (Critical) when *all* instances are offline|unhealthy.
2- NOC to maintain vigil on instances being online but services remaining unavailable. Standard Operating Procedure (SOP) is to restart services and determine root cause.
3- Fixed issue with invalid characters in VictorOps distribution list.
4- For Emergencies: Our internal support staff will make a phone call when in doubt! Texting isn't the preferred form of communicating critical issues. We will get acknowledgement if handing off issues to ensure a proper chain of possession and accountability from start to finish. 

DETAILS

At 4:45 AM EDT buildinfo services went offline. Between approximately 6:38 AM EDT and 7:00 AM EDT emails began to circulate regarding this issue. At 7:07 AM EDT TechOps confirmed receipt of the notification. LIVE-1104 ticket was created to document the events of this issue. At 12:21 PM EDT the buildinfo service was started by DevOps to bring services back online. DevOps determined a problematic Java version is running on the systems. At 12:29 PM EDT DevOps confirmed the new Java version, 1.7.0_67, was now in use for the buildinfo service. At 12:30 PM EDT confirmation was received by our Senior Programmer of our Engine Team that services were restored. 

RESPONSE DELAY ANALYSIS

Our monitoring system, Zabbix, sent out an alert that a buildinfo service was offline on one instance prior to the secondary. However, due to problems with alerts occurring during auto-scaling events and production pushes, it is difficult to immediately distinguish between valid and invalid alerts. Problematic alerts are "ELB (Partial)" and "Port Down" alerts. An ELB (all instances down) alert has been created. 

A problem was found in the TechOps on call distribution list During the conversion from Exchange to GMail, the email address used to send an alert to the VictorOps integration was mangled. This caused a failure of the email to send and VictorOps was not notified. In addition, the Epic Games Gmail domain defaults to disallowing external domains being included in Gmail groups. 

TechOps is currently down to an on-call rotation of two site reliability engineers. One was preparing for a flight, the other was unaware of the issue until a phone call was made. There was confusion of assumed responsibility once the engineer was in transit. The TechOps list did not notify DevOps personnel adequately of the responsibility change yet notified the Producer. It wasn't until after the call to the secondary TechOps engineer that the realization of miscommunication caused significant delay. DevOps promptly took action to resolve the issue. TechOps should not had assumed it to be a Producers role to pass responsibility during a technical issue unless mutually agreed to under the circumstances. 

RESPONSE MITIGATION

Alerting logic is the primary reason for the delay of the response times. The secondary contributing factor to the delay was that TechOps relies on source-based alerting from VictorOps (inclusive is the TechOps on call DL). Because preliminary notification was via email or text, there was a significant delay. 

- VictorOps uses multiple forms of alerting. 1- Android|iOS based app alerting, 2- SMS, 3- Phone Call.
- It has been iterated the need for verbal communication during crisis situations. 

TechOps on call DL should have sent an alert to the TechOps team. - This has been fixed
Zabbix should have sent a Critical alert when no servers were behind the ELB. - This has been fixed
Zabbix WebChecks should have alerted critical during a service outage. - A Zabbix upgrade has completed and we are adding webchecks

We apologize for the delay in notifying everyone and are working to get this information out more quickly. 

最近文章

Unreal Studio 4.20测试版现已推出!

什么工具能比Unreal Studio更好用?当然是Unreal Studio 4.20版啦!元数据导入,更流畅的导出流程以及在虚幻引擎中编辑网格体等...

Holospark的《地球沦陷(Earthfall)》为合作射击类游戏带来创新

总部位于西雅图的独立开发商Holospark在它的四人合作射击游戏 《地球沦陷》中展现了美国太平洋西北地区的风貌。

虚幻引擎帮助The Mill和Monster.com驱动怪物傀儡

屡获殊荣的The Mill工作室需要制作一些以巨型毛绒生物为主题的动画,而且要快。通过巧妙的技术以及虚幻引擎的帮助,他们的成果达到并且超出了Monst...