What the hell is happening to PSN?

DDoS security PSN System Administration Sony Musings of an SRE Technology

All day yesterday, I watched my husband trying to log into FF XIV on the PS4. All day, the PSN sign in servers remained down. They’re still listed as offline now. Xbox Live has been back up since yesterday afternoon. So what’s taking Sony so long?

Keeping production servers online is a large part of what I do professionally, so… I know this problem domain pretty well. And I’ve seen a lot of… speculation that is deeply misinformed. Here are my thoughts on the problem.

First: the cause of the outage. All evidence points to this being a DDoS (Distributed Denial of Service) attack. This is when a whole lot of computers from a lot of different locations send as much traffic as they can at a service, in an attempt to overwhelm it and knock it offline. The most common tool used to send all this traffic is a botnet. Building and maintaining a botnet requires a large amount of technical expertise. Using a botnet, on the other hand, just requires money and connections. Because the people who take the time to build a botnet often want to make money from it, so they sell time on them.

Which brings us to the culprits of the DDoS; a group calling themselves Lizard Squad has taken credit for the attack. Whether they have any technical expertise is unknown, but they certainly seem to have access to one or more reasonably effective botnets. However, they claim to have stopped their attack yesterday, and PSN remains offline. Mitigating DDoS attacks is a tricky problem; there are things that work pretty well, but there’s always an upper bound on how much traffic you can mitigate.

So there are a few possibilities.

  1. Lizard Squad is lying, and is still attacking PSN. If they have some vested interest in making Microsoft look more competent than Sony, this is pretty plausible. Mitigating a DDoS is a real challenge, and Sony and Microsoft both clearly can’t cope with these attacks. The usual solution would be to bring up more instances of the signin server; if that isn’t mitigating the issue then the network infrastructure may not be able to cope either. Which doesn’t say great things about Sony or Microsoft’s network infrastructure. But then, this whole scenario doesn’t say great things about the infrastructures of either services.
  2. Another group is also attacking PSN. Not much to add here; if Sony is still overwhelmed with traffic there’s little they can do.
  3. Sony intentionally kept PSN offline to do some sort of emergency upgrades. This seems really unlikely to me; there’s simply too much demand during the holidays to justify this. Sony would surely bring the servers back up and work on patches in parallel with that.
  4. The attack exposed a software bug in Sony’s signin servers. If the signin server software is crash-looping or inexplicably serving errors now, it may be down despite engineers working on a fix as hard as they can. This would suggest that they’re relying pretty heavily on some sort of stateful information that has entered a bad state, possibly a cache of some kind. (which can’t be invalidated for some reason) Another possibility, which would suck for everyone involved, is that some bug caused user authentication data to be corrupted when the server was overloaded. If Sony is having to restore username/password hash data from a backup, that would explain why they are still offline. It would also explain why PSN seems to be working for some users but not others right now.
Personally, I suspect #4. It fits the evidence and observed behavior of the system the best. If so, we can only hope that there’s no permanent authentication data lost, because that could mean broken, unrecoverable login accounts.