rails_configuration

MySQL
could not obtain a database connection within 5 seconds. The max pool size is currently 5;
Passenger Concurrency optimization * COMPLETED - 7th July *

MySQL

MySql server has gone away - Chapter TWO

The issue has returned, morphed - stemming from a different root cause.

We still cannot pinpoint a single source for this issue but reasons on why it is different:

Occurs equally likely on both the servers (vs. it only happened on prodapp01 before the previous fix).
It has a different error handling (previously it was silent. Now it is a flash message – Code fix for the same committed).
It occurs rarely (vs. multiple times a day previously). App Usage has not changed much.

Other Observations

It occurs with the exact same query (company table, select subdomain - with different values)
Load on the MySQL server is around 1-2% which is comparatively (to historical data) high when the error occurs.

Current leads

reconnect: statement in database.yml

With this setting on, ActiveRecord will re-establish connection when it detects a failed connection.
Cannot use this due to no support for UTF in new connections.

Code change which will reconnect broken connections.

http://coderrr.wordpress.com/2009/01/08/activerecord-threading-issues-and-resolutions/ Possible issues: Transaction data with reconnects and rollbacks.

Increase MySQL Packet Size.

Currently at 16M.
Have not seen such big queries so far nor any large packet log messages.

Investigate aborted_clients.

—

Optimization that will help

Stop dns lookups.

https://tracker.exphosted.com/view.php?id=5744

Increase wait_timeout duration.

It is currently at 600 seconds. Increasing it to 24 hours recommended as our 'clients' are trusted and we have a pooled mechanism on Rails end.

 wait_timeout = 28800

Install NewRelic MySQL gem

Will help dig deeper and record more information. Installation steps after the following para.

MySql server has gone away

Fix checklist # 1 COMPLETE

Verify that no users are on prodapp01 (although there session will be migrated seamlessly, scenario like file upload has not been tested).
Disable prodapp01 in haproxy

#on lb server 
echo "disable server learnexa/prodapp01"  | socat stdio /var/run/haproxy/haproxy.sock

Verify using admin stats page
Stop GOD and Apache.

#as expprodl
/deploy/systasks/god.sh stop
~/bin/apache_sss.sh stop

Uninstall mysql gem, Mysql-Devel package.

gem uninstall mysql
yum erase Mysql-Devel

Install mysql-devel 5.0.95-5.el5_9

yum install mysql-devel

install mysql gem

gem install mysql -v 2.8.1

Start God and Apache
Test.

Notes

Mostly observed on App1.
MySQL documentation states that certain server-client pair version differences might cause this issue.
It is observed that this issue is more frequent on prodapp01.
It is observed that the mysql-devel library on prodapp01 is different from prodapp02.

** Schedule to upgrade the gem, asap.
No downtime necessary.

Identify performance bottlenecks

Using NewRelic MySQL Plugin

https://tracker.exphosted.com/view.php?id=5620
Installation steps (requires login) : https://rpm.newrelic.com/accounts/710727/plugins/directory/52
Also, add long_query_time either sing plugin conf or MySQL conf.
Key Points:

We already have JRE on db slave so install not needed.
Installation steps do not explicitly state whether MySQL server will be affected after install. Hence scheduling downtime for the first time install.
On Monday, long_query_time will be reset to 5.

This is a manual method

Recurring Replay checklist

Schedule downtime
Stop all God processes and apache on both app server
Backup DB.
Add the code below in /etc/my.cnf

sync_binlog=1
slow-query-log = 1 
slow-query-log-file = /var/log/mysql/mysql-slow.log
long_query_time = 0
log-queries-not-using-indexes

Stop Slave, Stop Master, Start Master, Start Slave.
Verify:
1. Replication status. If not, re-establish replication.
2. Slow queries are logged
Start GOD processes and apache
Test
Generate load over period of next 24 hours from each feature functionality.
Also, monitor the log file size over the “test” period.
Based on the Notes below (in the next Notes section), I do not expect this change to overwhelm current systems.

Rollback

Rollback will be 24 hours or earlier.
Steps for rollback will be the same, except instead of adding the “code” to my.cnf, we will remove it.

Fix checklist # 2 COMPLETE

Schedule downtime
Stop all God processes and apache on both app server
Backup DB.
Add the code below in /etc/my.cnf

sync_binlog=1
slow-query-log = 1 
slow-query-log-file = /var/log/mysql/mysql-slow.log
long_query_time = 1
log-queries-not-using-indexes

Stop Slave, Restart each, Start Master, Start Slave.
Verify:
1. Replication status. If not, re-establish replication.
2. Slow queries are logged
Start GOD processes and apache
Test
Generate load over period of next 24 hours from each feature functionality.
Also, monitor the log file size over the “test” period.
Based on the Notes below (in the next Notes section), I do not expect this change to overwhelm current systems.

Rollback

Rollback will be 24 hours or earlier.
Steps for rollback will be the same, except instead of adding the “code” to my.cnf, we will remove it.

Notes

Slow running queries & errors.

|Master||
|  Aborted_clients                         | 8924         |
| Aborted_connects                         | 14   |
| Slow_queries                             | 121          |
|Slave||
| Aborted_clients                          | 7360         |
| Aborted_connects                         | 4            |
| Slow_queries                             | 0          |

Disk IO.

|Master||
| Created_tmp_disk_tables                  | 41209        |
| Created_tmp_files                        | 49           |
| Created_tmp_tables                       | 1138451      |
|Slave||
| Created_tmp_disk_tables                  | 17750        |
| Created_tmp_files                        | 15           |
| Created_tmp_tables                       | 1403820      |

Other optimizations to research into in future.

Concurrent threads
Buffers (On slave)
Max connections
IO wait time for mysql process - if any.

could not obtain a database connection within 5 seconds. The max pool size is currently 5;

https://tracker.exphosted.com/view.php?id=5696

Notes

A quick string match through the code base to find any use of class connection_pool's checkout returned zero hits.
Requested Udaya to ensure that no connections were manually checked out in the codebase.
It is very likely that we need to increase the number of max connections.
Udaya mentioned that the pool was increased to 10 connections on QA environment and occurrence of the error message was reduced. Need to verify the associated resource cost.
Report back with resource cost of increasing the pool by 25.

Test results ( on dev )

Factors to consider:

Limited resources.
Multiple roles on same machine.
MySQL RAM usage is not displayed. MySQL CPU and RAM usage was consistently linear with a very small step up corresponding to increase in number of requests.

5	6.95+940.22	~38-52%	2000-50	15.2	61.333	7	12	Not sure why Passenger is consuming ~200M more here
15	13.82+766.20	same	same	17.17	56.949	3	1
23	13.82+798.08	same	same	18.19	58.190	0	0
30	20.25+798.98	same	same	18.17	58.2	0	0

Recommendations and next steps

Wait for a few days to see if Rufus complains on DEV environment. However, we are good to increase the pool to 23, monitor for a few days and then to 30.

Prod checklist

0) Check that no sessions are on the server.
1) Bring app server out of rotation.
2) Increase max connections in database.yml - 23
3) Restart passenger.
4) Quick test and put server back in rotation.
5) Repeat for the other app server.
6) Monitor app server and db logs from this point on.

Passenger Concurrency optimization * COMPLETED - 7th July *

1) Bring app server out of pool.
2) Edit /opt/apache2/conf/httpd.conf, add

PassengerMaxPoolSize 13

3) Restart Passenger
4) Test.
5) If no issues:
6) Put back app server in pool and loop with the other app server.

Rollback:
1) Bring app server out of pool
2) Remove PassengerMaxPoolSize line.
3) Restart passenger (and make the first request).
https://tracker.exphosted.com/view.php?id=5683

Table of Contents

MySQL

MySql server has gone away - Chapter TWO

Other Observations

Current leads

Optimization that will help

MySql server has gone away

Fix checklist # 1 *COMPLETE*

Notes

Identify performance bottlenecks

Using NewRelic MySQL Plugin

This is a manual method

Recurring Replay checklist

Rollback

Fix checklist # 2 *COMPLETE*

Rollback

Notes

could not obtain a database connection within 5 seconds. The max pool size is currently 5;

Notes

Test results ( on dev )

Recommendations and next steps

Prod checklist

Passenger Concurrency optimization * COMPLETED - 7th July *

Fix checklist # 1 COMPLETE

Fix checklist # 2 COMPLETE