This war story is a retelling of one time we at strongDM engineering did something interesting (worth writing an article about, even). These types of articles are deep code investigations, incident breakdowns, seeing into the matrix, etc.
Act 1: The Problem Report
March 2020, a new support ticket: Oracle connections through strongDM are failing. The database in question has an interesting configuration: they’re hosted on a SPARC server, which means our usual tests against Oracles hosted locally or on a cloud service like AWS might not cover exactly how these dedicated machines run the software.
Our support team takes up the ticket, sees the failure on call with the customer, and signs off, now attempting to reproduce the failure themselves to escalate up the line for a backend engineer (or two) to fix. We get an easy reproduction. It seems like the client the customer is using (DBVis) just never works with strongDM. We’ve got the client configuration, we’ve got reproducible error messages, and we can fix it in our custom Oracle driver.
Act 0: What’s Our Driver?
strongDM was built with a lofty, or crazy, thought: what if we rewrote every single database driver to live inside a common protocol that would support two things:
- Rewriting the authentication requests clients send to instead auth how a customer had
configured the database in strongDM to be authed, and
- Tracking requests sent to the target database, to audit everything clients did and report those requests back to the mentioned administrators via audit logs.
On the one hand, database drivers are incredibly complex, specific software with decades of history and edge cases they need to support, so this shouldn’t be a practical task to achieve for a handful of engineers who, for the most part, had never done anything like this before. On the other hand, database protocols are almost all publicly documented and easy to find, and there are many open source drivers we can use to validate our implementations. Additionally, because we don’t need to actually parse everything about requests and responses, we just need to do those two mentioned things, we can skip implementing half or more of most drivers’ features and just hook in to the connection at the specific points we care about.
Oracle as a database driver is one of our least easy protocols to implement—every client seems to send differently structured requests, and every server seems to respond differently to different clients. Our Oracle driver contains code supporting at least three different servers and at least seven different clients, and we often need to do specific things for specific client/server pairs.
One saving grace is our hard stop: we do not support servers or clients that are formally no longer supported by the database maintainers. We won’t delete our support for those older drivers if they become deprecated, but we don’t need to go back all the way to the beginning of Oracle’s history and support every server version ever released. For Oracle, we only need to support 11.2 (aka 11g Release 2) and greater as of March 2020 (11.2 support was dropped by Oracle in December 2020).
Act 2: We Solved It!
This specific problem was a connection between Oracle 11g and DBViz, according to the report. The error message, however, suggested that it was a TLS parsing issue. We fixed how we padded AES packets and while we were there, noticed that JDBC6 connections to Oracle 12c could work using an existing code path—so we enabled that client/server pair.
Following and during this, we underwent a grueling process of iterating the seven Oracle clients we knew were in use against the three Oracle servers we officially supported—11g, 12c, and 19c—21 different manual tests. Our results:
We knew the customer was using DBVis, so we sent off this fix, opened new issues for the known failing versions, and everyone was happy.
Act 3: But Did We Solve It?
. . . Until the customer tried the new version, and it was still failing with the same error message. We extract some more information—the customer is using Sqoop and Hadoop to transfer the data for ETL jobs. We retest against our own 11g server with this setup, and everything succeeds.
It was April by this point, and we were running out of ideas. We still had one thread to pull—what if the error is something specific to how SPARC runs Oracle 11g? Should we make the multi-thousand dollar purchase of a SPARC server, or rent one, and learn how to configure it to match the customer’s configuration for this bug? Emails are exchanged:
The support team concluded last night that the installation of Oracle on SPARC would be prohibitively complex. It's not impossible, but likely requires many days of dedicated attention with an uncertain outcome.
We’d like to formally request access to a test user or instance in %CUSTOMER’s environment.
Act 4: The Mystery is Revealed
May 2020. Several emails, Account setups, VPN setups and internal tunnels later, and we’ve got a direct Oracle connection to the customer’s hardware failing with the error the customer mentioned. We dig up Wireshark and start inspecting the differences between a raw connection, a connection through a strongDM driver, and a strongDM connection to a non SPARC instance. First thing we notice: we’re parsing an internal key:value structure incorrectly. The driver assumes that there will always be a value if there is a key, but in this case the structure changes and there’s no value to match some keys passing through the connection.
Instead of `01 1d 1d <key value> 00 02 09 39 <key value>`, we get `01 1d 1d <key> 00 02 09 39 <key value>`.
It’s an easy fix. The error persists.
The error tells us that when we attempted to send our forged authentication to the Oracle server, it rejected our credentials. That unfortunately means that we don’t know what code is detecting the problem—it’s just something inside of Oracle that doesn’t like how we forged the authentication. The packets are different though, comparing a raw connection to SPARC and a raw connection to an AWS hosted 11g.
One parameter in the authentication is the `AUTH_VFR_DATA`. This is a password salt and is a mandatory component of 11g authentication. As a part of our forgery, we regenerate cryptographically random bytes and replace the salt that the client sent with new data before sending it off to the server. When our client talks to the SPARC server, however, it does not send anything for `AUTH_VFR_DATA`. We’re sending a salt when the server does not expect a salt, and so our forgery is rejected.
Our search engine friends, when we paste the raw bytes we’re seeing differ in the connection, point us to this issue in an open source erlang Oracle driver. SPARC isn’t a factor. The customer’s 11g instance is configured to use older, Oracle 10g authentication, which our driver doesn't have code for.
Act 5: The Resolution
The customer is justifiably expecting a response, so engineering writes up the diagnosis and sends it off to more customer-facing parts of the company to communicate. Meanwhile, we start coding and can’t easily get 10g authentication working in our driver—it’ll take more than a couple of hours for one engineer to implement, and once we’ve got it implemented we’ll need to retest the 21 old client/server pairs to ensure backwards compatibility as well as the 7 new pairs connecting to Oracle 10g. To make matters worse, we can’t find an affordable way to spin up a 10g instance to test against.
There’s some back and forth with the customer. Can you upgrade to 12c auth? Well, documentation implies that will cause all passwords in the system to be reset, and we’ve got a lot of them to manage. Can you upgrade to 11g auth? No, this 11g server is talking with a 10g server behind the scenes, and the two servers need a compatible auth method. We’re not Oracle DBAs and there are Oracle DBAs on their side, so it takes a little work to communicate how we imagine their instance is configured and what we think will happen if they change their settings.
A resolution is reached, however. We successfully conclude together they are using 10g auth, that we don’t support 10g auth, that it would take us more than a trivial amount of time to support 10g auth, and that technically this is an auth method that is not supported by Oracle anymore . . . . The customer, again knowing more about Oracle than we do, changes some secret settings on their end and the connection goes green. The ticket is closed, we don’t support Oracle 10g (yet), and we’ve succeeded at decoding something interesting.
There’s just one big problem remaining : we’re going to run into the escalating count of client/server pairs to test more and more as we support more databases, clients, and server types. The QA cost of having to dive in and fix a driver is increasing to the point of it taking multiple days to finish. And we wonder, how hard would it be to test all of these configurations in an automated fashion? I guess time will tell.