DAD eliminates top DB CPU consumer with a single environment variable
The Database Admin (DAD) team is responsible for DB performance, and let’s just say Procore’s growth is keeping us pretty busy. We recently noticed that
SET application_name = ? has been climbing the ranks of RDS Performance Insights (PI) until it became the #1 consumer of CPU on our DB cluster. We will share our story of how we found the originator of the top consumer and identified a simple one line fix.
What's going on?
This top consumer was quite a surprise to us. After all, a simple
SET statement takes less than a millisecond:
procore=> SET application_name = 'this_is_a_test'; SET Time: 0.822 ms
How often were we executing this SET statement for it to rise to the top of PI? As usual, we began with testing locally using a Rails server and a PostgreSQL server with
log_min_duration_statement = 0 to log 100% of queries. What we saw is that Rails would set the application name on the initial connection and that time alone. It looked like:
SET application_name = 'puma: cluster worker 0: 1 [procore]';.
Find the culprit
After witnessing Rails seemingly do the correct thing, we expanded our testing to include PgBouncer which we run in transaction mode to pool DB connections. For reference, our current architecture uses a PgBouncer sidecar container running alongside our Rails application containers on the same EC2 instance. After inserting PgBouncer into our local test setup, we found what was wrong. We started seeing things like so:
LOG: duration: 0.087 ms statement:
SET application_name = 'puma: cluster worker 0: 1 [procore]';
LOG: duration: 0.064 ms statement: begin;
LOG: duration: 0.059 ms statement: select 1;
LOG: duration: 0.033 ms statement: commit;
LOG: duration: 0.093 ms statement:
SET application_name = 'puma: cluster worker 0: 2 [procore]';
* Note that the application_name is different in each SET statement
We did some googling, and it turned up this PgBouncer issue on GitHub. Realizing that our application is setting a different application_name for each puma thread, we wondered whether by forcing the application names to be the same, we would stop PgBouncer from issuing all the
SET statements or whether we’d have to fork it and make other changes.
Test the Fix
Quickly, we reached out in the org to some of the folks who are more familiar with C. With their help, we learned that when running PgBouncer in `transaction` pooling mode, PgBouncer compares the previous setting of application_name with the current one, and happily issues a
SET application_name = ? statement on the app’s behalf if they are different. You can see the comparison here: https://github.com/pgbouncer/pgbouncer/blob/pgbouncer_1_15_0/src/varcache.c#L94-L96
So, we knew that if we set application_name to something with low cardinality, we could save the DB cluster a fair bit of work. Next, we looked at the pg gem to see how it was determining the application_name to use and whether it could be easily overridden. After perusing the code from the gem, we noticed that pg sets fallback_application_name to the process name, which is used when application_name is unset. We further theorized that we could simply use the
PGAPPNAME environment variable since the pg gem uses libpq under the hood. After a quick test of injecting the environment variable into our containers, we saw the undesirable behavior cease.
We then made a PR into our deployment system to set reasonable values for
PGAPPNAME that didn’t include the process name but were informative enough about the application server to be useful. And, voila! That SET statement dropped off the Performance Insights top 10 list! In fact, we believe that we may have dropped it out of the top 1000 :).
If you find that your database load by waits analysis includes a
SET application_name = ? and you use PgBouncer with the Ruby pg gem, then simply reduce the number of possible application_names. We hope this helps anyone who encounters this situation.
Visit our career site to start your career journey at Procore today!