Table of Contents

MySQL

MySql server has gone away - Chapter TWO

We still cannot pinpoint a single source for this issue but reasons on why it is different:

  1. Occurs equally likely on both the servers (vs. it only happened on prodapp01 before the previous fix).
  2. It has a different error handling (previously it was silent. Now it is a flash message – Code fix for the same committed).
  3. It occurs rarely (vs. multiple times a day previously). App Usage has not changed much.
Other Observations
  1. It occurs with the exact same query (company table, select subdomain - with different values)
  2. Load on the MySQL server is around 1-2% which is comparatively (to historical data) high when the error occurs.
Current leads

With this setting on, ActiveRecord will re-establish connection when it detects a failed connection.
Cannot use this due to no support for UTF in new connections.

http://coderrr.wordpress.com/2009/01/08/activerecord-threading-issues-and-resolutions/ Possible issues: Transaction data with reconnects and rollbacks.

Currently at 16M.
Have not seen such big queries so far nor any large packet log messages.

Optimization that will help

https://tracker.exphosted.com/view.php?id=5744

It is currently at 600 seconds. Increasing it to 24 hours recommended as our 'clients' are trusted and we have a pooled mechanism on Rails end.

 wait_timeout = 28800

Will help dig deeper and record more information. Installation steps after the following para.

MySql server has gone away

Fix checklist # 1 *COMPLETE*
#on lb server 
echo "disable server learnexa/prodapp01"  | socat stdio /var/run/haproxy/haproxy.sock
#as expprodl
/deploy/systasks/god.sh stop
~/bin/apache_sss.sh stop
gem uninstall mysql
yum erase Mysql-Devel 
yum install mysql-devel 
gem install mysql -v 2.8.1
Notes
  1. Mostly observed on App1.
    MySQL documentation states that certain server-client pair version differences might cause this issue.
  2. It is observed that this issue is more frequent on prodapp01.
  3. It is observed that the mysql-devel library on prodapp01 is different from prodapp02.

** Schedule to upgrade the gem, asap.
No downtime necessary.

Identify performance bottlenecks

Using NewRelic MySQL Plugin

https://tracker.exphosted.com/view.php?id=5620
Installation steps (requires login) : https://rpm.newrelic.com/accounts/710727/plugins/directory/52
Also, add long_query_time either sing plugin conf or MySQL conf.
Key Points:

  1. We already have JRE on db slave so install not needed.
  2. Installation steps do not explicitly state whether MySQL server will be affected after install. Hence scheduling downtime for the first time install.
  3. On Monday, long_query_time will be reset to 5.

This is a manual method

Recurring Replay checklist
sync_binlog=1
slow-query-log = 1 
slow-query-log-file = /var/log/mysql/mysql-slow.log
long_query_time = 0
log-queries-not-using-indexes
Rollback

Rollback will be 24 hours or earlier.
Steps for rollback will be the same, except instead of adding the “code” to my.cnf, we will remove it.

Fix checklist # 2 *COMPLETE*
sync_binlog=1
slow-query-log = 1 
slow-query-log-file = /var/log/mysql/mysql-slow.log
long_query_time = 1
log-queries-not-using-indexes
Rollback

Rollback will be 24 hours or earlier.
Steps for rollback will be the same, except instead of adding the “code” to my.cnf, we will remove it.

Notes
|Master||
|  Aborted_clients                         | 8924         |
| Aborted_connects                         | 14   |
| Slow_queries                             | 121          |
|Slave||
| Aborted_clients                          | 7360         |
| Aborted_connects                         | 4            |
| Slow_queries                             | 0          |
|Master||
| Created_tmp_disk_tables                  | 41209        |
| Created_tmp_files                        | 49           |
| Created_tmp_tables                       | 1138451      |
|Slave||
| Created_tmp_disk_tables                  | 17750        |
| Created_tmp_files                        | 15           |
| Created_tmp_tables                       | 1403820      |
  1. Concurrent threads
  2. Buffers (On slave)
  3. Max connections
  4. IO wait time for mysql process - if any.

could not obtain a database connection within 5 seconds. The max pool size is currently 5;

https://tracker.exphosted.com/view.php?id=5696

Notes
  1. A quick string match through the code base to find any use of class connection_pool's checkout returned zero hits.
  2. Requested Udaya to ensure that no connections were manually checked out in the codebase.
  3. It is very likely that we need to increase the number of max connections.
  4. Udaya mentioned that the pool was increased to 10 connections on QA environment and occurrence of the error message was reduced. Need to verify the associated resource cost.
  5. Report back with resource cost of increasing the pool by 25.
Test results ( on dev )

Factors to consider:

  1. Limited resources.
  2. Multiple roles on same machine.
  3. MySQL RAM usage is not displayed. MySQL CPU and RAM usage was consistently linear with a very small step up corresponding to increase in number of requests.

| Pool Size | Apache + All Passenger processes (private_dirty)MB | CPU | No. of Requests - Concurrent Users |RPS| Response Time Mean|Aborted_clients| Aborted_connects | Notes |

5 6.95+940.22~38-52%2000-5015.261.333712 Not sure why Passenger is consuming ~200M more here
15 13.82+766.20 same same17.1756.94931
2313.82+798.08 same same18.1958.19000
3020.25+798.98 same same 18.17 58.200
Recommendations and next steps

Wait for a few days to see if Rufus complains on DEV environment. However, we are good to increase the pool to 23, monitor for a few days and then to 30.

Prod checklist

0) Check that no sessions are on the server.
1) Bring app server out of rotation.
2) Increase max connections in database.yml - 23
3) Restart passenger.
4) Quick test and put server back in rotation.
5) Repeat for the other app server.
6) Monitor app server and db logs from this point on.

Passenger Concurrency optimization * COMPLETED - 7th July *

1) Bring app server out of pool.
2) Edit /opt/apache2/conf/httpd.conf, add

PassengerMaxPoolSize 13

3) Restart Passenger
4) Test.
5) If no issues:
6) Put back app server in pool and loop with the other app server.

Rollback:
1) Bring app server out of pool
2) Remove PassengerMaxPoolSize line.
3) Restart passenger (and make the first request).
https://tracker.exphosted.com/view.php?id=5683