We experienced some issues relating to login behaviour on the 9th and 10th November 2020. This article details what happened and what steps were taken. We're really sorry for the trouble caused by this.
To keep everyone's login's secure at Accurx, we have a two phase process. Imagine 2 keys, the first key is used to check that you are you and should be using Accurx at all. But that first key is itself only able to be accessed by another key (a bit like storing your key for your safe inside a locked filing cabinet). To ensure an extra level of safety, these keys are rotated every 90 days, with overlap so no one gets logged out. However, a small part of the Accurx infrastructure didn’t use the two key process and just tried to use one key. In addition, last night the end of this 90 day period was reached. This meant that when we rotated our keys, the wrong process (1 key, not 2) was used. This led to people who were correctly logged in being logged out and their Accurx desktop stopped working.
Sadly, at the same time this morning, EMIS Web suffered an issue which masked this problem which meant we were slightly slower to act on this than we'd hope for.
None of this affected the security of our system, and we are putting in short term measures to ensure this problem can’t re-occur, with more medium term steps to isolate each piece of infrastructure so they cannot affect each other, period.
A similar issue reoccurred on the 10th November, caused by some legacy data which meant that some of our services had not updated to use the new keys. This meant that many users were regularly being logged out of their accounts and prompted to log in again.
Unfortunately, resolving this meant restarting various services, which added more disruption to the Accurx service.
What was our timeline and plan of attack?
09:15 First log messages indicating issues with our key
09:45 Users first start reporting issues, raised with engineering team
09:55 Restart Accurx server
10:15 Regenerate encryption keys
10:30 Re-release latest update and restart of Chain Server
10:50 Issue appears resolved
Around 23,000 of our users were affected by this outage, and were emailed later in the day to acknowledge and apologise for this issue.
07:50 Users report issues again relating to logging in
09:00 Regenerate encryption keys
09:30 Re-release Florey, Chain, Ticket servers to roll out new keys
10:15 Confirmed resolved
Around 21,000 found that they had to log in 3 times or more. Those users were followed up by email to acknowledge and apologise for this issue, again.
Our engineering team are working on a full RCA and report on this, and we are working on further improvements to avoid this happening in the future. We'll also be monitoring very closely over the next few days to catch any potential issues much more quickly.