Log management best practices: auditing production systems

Why would I need to audit my production systems?

First reason: Legal Requirements

Some regulated environments requires that access and action on a database be tracked.

The image below is a capture of version 3.2.1 of the PCIDSS standard:

For health data the Nationwide Privacy and Security Framework for Electronic Exchange of Individually Identifiable Health Information is a bit less prescriptive but the obligation results in a good audition system in place:

“Persons and entities should take reasonable steps to ensure that individually identifiable health information is complete, accurate, and up-to-date to the extent necessary for the person’s or entity’s intended purposes and has not been altered or destroyed in an unauthorized manner.”

This one is interesting because it brings up an important reason to audit your system queries: ensuring data integrity. It would be easy to assume that data is safe if access is restricted to staff in clearly defined roles. Afterall, you only hire professional and trustworthy people.

In this day and age, it’s critical to trust, but verify. That requires forensic evidence. If someone claims their data has been improperly accessed or tampered with, you need a proper log management solution to prove their claim is false. To do that, it’s essential that your system logs every action, not just the security logs. For example, application logs and operating system logs may contain security-related information as well as log messages about events that may not initially appear security related. It's important that the potential value of different sources and log events are considered. Furthermore, it’s not sufficient that log entries demonstrate:

  • access to applications, databases or servers is restricted to specific people or roles
  • only these staff had sessions on a given day
  • these commands were executed, but by a shared credential so no clear authorship

Your log management set up needs to provide for all three in order to answer who did what, where and when.

Second Reason: Data Integrity

Ensuring data integrity means doing a lot of things, A LOT! This doesn’t just mean you have to backup data and set proper access control to prove it hasn’t been tampered with. You also need to track all changes to records to demonstrate that nothing was modified post ingestion from an external data source (client input, as form, mail or upload for example).

You must be able to prove that no system administrator or developer has modified the data from the original input. To do that log analysis, you need to log data from both human and machine interactions.

When humans interact with data, sometimes that occurs in your application. In those cases, activities should be tracked in the application logs itself. Other times, humans might query a database or ssh to a web server containing sensitive data. In those cases you will need another approach to log information from those sessions, queries, and commands.

We can all agree that in an ideal world no-one would access the DB and all changes would run through a deployment pipeline and be subject to version control. In reality, that is not always true. Sometimes just finding what went wrong in code implies connecting to the database to investigate. Without a record of the queries during that session, you would be unable to prove what that developer did.

Third Reason: Forensic analysis

This is the most important reason to create audit logs, especially for databases and servers. While most engineering teams claim to do “blameless postmortems”, it is impossible to conduct a postmortem without an event log of who issued each query. That way you know what happened and how to roll back.

One way to achieve that is to force all developers to query through an IDE or SQL interface. However, what is missing is code error from an ORM framework on a developer workstation. This kind of generated queries are hard to guess from the object code and can prove to be a headache to reverse engineer to fix a casual error where the workstation has used the production DB instead of QA, or just because a fix code had an oversighted side effect when correcting a bug, there’s too much cases to name them all and the usual quote “If it can happen, it will happen, the question is When?” Then you must ask “when it happens, how do you plan to recover.”

Some version of these problems occurs pretty regularly. Sometimes the answer is just to restore, even if it includes sensitive data loss. In the best case, this leads to useful postmortems as Gitlab has done a few years back.

Fourth Reason: Because You Can :)

Now I know we all should follow log management best practices, but my mother also said I should eat spinach (spoiler alert, I did not). Why? Because best practices are hard. I’ve insisted that queries & ssh commands should be logged because they’re simpler to argue about. But the list isn’t just those. It also includes system settings; tinkering with the system clock or configuration could cause a fair amount of problems as well.

There are several ways to create that audit trail, including:

These DIY approaches take some work to build and maintain, but they’ll do the trick. If you have budget, try strongDM. strongDM eliminates the PAM and VPN hell with a protocol aware proxy that secures access to any database, Linux or Windows server, k8s or internal web application.

From my experience StrongDM provides a straightforward and secure approach to gateway audit systems. It doesn’t solve all problems, of course, but it does a good job covering the bases I mentioned above with JSON logs that are easy to parse and consolidate. Another benefit to logging via strongDM is that they allow you to identify long-running queries which may have impacted application performance. Once you’ve figured out the queries causing performance degradation, you can refactor them to be more effective or schedule them in a low activity timespan.

There are also other benefits to using strongDM. Using it to secure your access gets you not only comprehensive log files, but one-click user onboarding and offboarding, audit of access permissions at any point in time, real-time streams of queries in the web UI, and fully replayable server and k8s sessions. It’s a comprehensive suite of tools to manage access to your internal resources.