Incident Response Story
I had an incident recently where I saw an unexpected user login and then my logs were deleted. On a live production system.
Basically, this meant holy shit. The sky was falling, we were hacked! Luckily, I stayed calm, kept an open mind,and tracked the issue. I didn't panic, and I didn't break anything. A lot of the time when I am writing crisis response documentation, I am encouraged to shut down the system, and take the downtime, as soon as something like this happens (you see a hacker appear for example). This would have lead me to taking down the server. Which would have done more damage to the company than leaving it on. Hindsight is 20/20 of course.
Backing up - what happened. Firstly, the good. I got a slack message, into a general alerts channel, that a user not on the approved list had logged in. Specifically the Amazon provided "ec2-user" which we never use, and only continues to exist to own some files as uid 1000 rather than root. I'd prefer to have users ack their own logins, but for now we have a whitelist. So, the first pivot I had was to look at logs in context of sshd on the server in our central log store. So far so good. It seemed really weird. I saw a login, from our office IP, into a server (one of a fairly large farm too, not a unique server), followed by a logout within a second. Next step login to the server, make sure that user doesn't have a key associated. Or a password. Good. No user is currently logged in. No process is running that I can see (I am not diving into rootkits and other stuff yet, I just want a timeline). Here is where I go check /var/log/syslog, and notice no events between Feb 7th and today shortly after the incident. This is where we decided it was probably a compromise. But from our own office. And logs deleted so sloppily. It didn't feel right. One of the important parts of this is that I can lose the server, it's load balanced and so the only thing that could have been damaged was traffic. But if someone got in, then there was an earlier compromise, so, better to find out what has happened then try to clean up. Next step is to go to our SIEM, which isn't watching logs, instead trying to watch system events. Nothing connected on port 22 to the server within the hour that the event happened in. (It happened at 15:58, so my login wasn't until 16:10ish so I didn't even see me).
This is the point where I started thinking about incident response. Tearing everything down. Beginning to pull in people. Taking down production. Preparing client statements. But it still felt stupid. They didn't seem to be running anything weirdly. External performance monitoring didn't disagree with internal performance monitoring. I didn't see any users. I didn't see anything weird operating on the worker. Why didn't I see the login in my SIEM? If someone has done any hacking before, why delete 7 days? Now, 7 days was when the server had been provisioned, meaning that the server had deleted all logs since it had been started. More than that, I could see me logging in earlier that day to deploy new code to the server using Ansible. So our SIEM was good.
So, finally, I began to look at the logs directly. what did our log store show that the server itself didn't. Remember, I'd only looked at sshd logs. Well. It turns out the rest of the logs were the system reboot, systemd-journald discarding a corrupted journal, and reopening the log file after the first boot lines in the log file. It also dumped a record of the last successful boot (when the image was created, the only time we login as ec2-user). And so - the server had simply crashed ungracefully, and systemd had helped by deleting all the logs of the event, and dumping old logs to rsyslog. It took me about 45 minutes to track down all of this, which was pretty good, and I quickly wrote something to explain the incident to the team, to make sure that people could see the value of our systems.
I don't know what the moral of the story is, but it's probably the most fun you'll get to have with DFIR that doesn't involve a sleepless night. But I do think there's something to say about not panicking in incident response. Obviously, if this had been a different server, or a more obvious login, it would be a good idea to shut it down. But in this case, it was easy to jump to a scary hacker, and build a giant problem where there was none. One of the things I appreciate is that I could quickly see that the initial login wasn't real from the SIEM. It kept me suspicious of the event. I am glad my central log store alerted me. It's great to know if there's a real issue of this kind, I'll know. I am glad I had central logs to be able to store logs that got destroyed by bad logging tools.
All in all, this was a triumph of tools being available, and a good lesson to me in not panicking.