==== MySQL ====
=== MySql server has gone away - Chapter TWO===
* The issue has returned, morphed - stemming from a different root cause. \\
We still cannot pinpoint a single source for this issue but reasons on why it is different:
- Occurs equally likely on both the servers (vs. it only happened on prodapp01 before the previous fix).
- It has a different error handling (previously it was silent. Now it is a flash message -- Code fix for the same committed).
- It occurs rarely (vs. multiple times a day previously). App Usage has not changed much.
== Other Observations ==
- It occurs with the exact same query (company table, select subdomain - with different values)
- Load on the MySQL server is around 1-2% which is comparatively (to historical data) high when the error occurs.
== Current leads ==
* reconnect: statement in database.yml
With this setting on, ActiveRecord will re-establish connection when it detects a failed connection. \\
Cannot use this due to no support for UTF in new connections.
* Code change which will reconnect broken connections.
http://coderrr.wordpress.com/2009/01/08/activerecord-threading-issues-and-resolutions/
Possible issues: Transaction data with reconnects and rollbacks.
* Increase MySQL Packet Size.
Currently at 16M. \\ Have not seen such big queries so far nor any large packet log messages.
* Investigate aborted_clients.
---
== Optimization that will help==
* Stop dns lookups.
https://tracker.exphosted.com/view.php?id=5744
* Increase wait_timeout duration.
It is currently at 600 seconds. Increasing it to 24 hours recommended as our 'clients' are trusted and we have a pooled mechanism on Rails end.
wait_timeout = 28800
* Install NewRelic MySQL gem
Will help dig deeper and record more information. Installation steps after the following para.
=== MySql server has gone away ===
== Fix checklist # 1 *COMPLETE*==
* Verify that no users are on prodapp01 (although there session will be migrated seamlessly, scenario like file upload has not been tested).
* Disable prodapp01 in haproxy
#on lb server
echo "disable server learnexa/prodapp01" | socat stdio /var/run/haproxy/haproxy.sock
* Verify using admin stats page
* Stop GOD and Apache.
#as expprodl
/deploy/systasks/god.sh stop
~/bin/apache_sss.sh stop
* Uninstall mysql gem, Mysql-Devel package.
gem uninstall mysql
yum erase Mysql-Devel
* Install mysql-devel 5.0.95-5.el5_9
yum install mysql-devel
* install mysql gem
gem install mysql -v 2.8.1
* Start God and Apache
* Test.
==Notes==
- Mostly observed on App1. \\ MySQL documentation states that certain server-client pair version differences might cause this [[http://dev.mysql.com/doc/refman/5.0/en/gone-away.html|issue]].
- It is observed that this issue is more frequent on prodapp01.
- It is observed that the mysql-devel library on prodapp01 is different from prodapp02.
** Schedule to upgrade the gem, asap. \\ No downtime necessary.
=== Identify performance bottlenecks ===
=== Using NewRelic MySQL Plugin ===
https://tracker.exphosted.com/view.php?id=5620 \\
Installation steps (requires login) : https://rpm.newrelic.com/accounts/710727/plugins/directory/52 \\
Also, add long_query_time either sing plugin conf or MySQL conf. \\
Key Points: \\
- We already have JRE on db slave so install not needed.
- Installation steps do not explicitly state whether MySQL server will be affected after install. Hence scheduling downtime for the first time install.
- On Monday, long_query_time will be reset to 5.
=== This is a manual method ===
== Recurring Replay checklist ==
* Schedule downtime
* Stop all God processes and apache on both app server
* Backup DB.
* Add the code below in /etc/my.cnf
sync_binlog=1
slow-query-log = 1
slow-query-log-file = /var/log/mysql/mysql-slow.log
long_query_time = 0
log-queries-not-using-indexes
* Stop Slave, Stop Master, Start Master, Start Slave.
* Verify:
- Replication status. If not, re-establish replication.
- Slow queries are logged
* Start GOD processes and apache
* Test
* Generate load over period of next 24 hours from each feature functionality.
* Also, monitor the log file size over the "test" period. \\ Based on the Notes below (in the next Notes section), I do not expect this change to overwhelm current systems.
== Rollback ==
Rollback will be 24 hours or earlier. \\ Steps for rollback will be the same, except instead of adding the "code" to my.cnf, we will remove it.
== Fix checklist # 2 *COMPLETE* ==
* Schedule downtime
* Stop all God processes and apache on both app server
* Backup DB.
* Add the code below in /etc/my.cnf
sync_binlog=1
slow-query-log = 1
slow-query-log-file = /var/log/mysql/mysql-slow.log
long_query_time = 1
log-queries-not-using-indexes
* Stop Slave, Restart each, Start Master, Start Slave.
* Verify:
- Replication status. If not, re-establish replication.
- Slow queries are logged
* Start GOD processes and apache
* Test
* Generate load over period of next 24 hours from each feature functionality.
* Also, monitor the log file size over the "test" period. \\ Based on the Notes below (in the next Notes section), I do not expect this change to overwhelm current systems.
== Rollback ==
Rollback will be 24 hours or earlier. \\ Steps for rollback will be the same, except instead of adding the "code" to my.cnf, we will remove it.
==Notes==
* Slow running queries & errors.
|Master||
| Aborted_clients | 8924 |
| Aborted_connects | 14 |
| Slow_queries | 121 |
|Slave||
| Aborted_clients | 7360 |
| Aborted_connects | 4 |
| Slow_queries | 0 |
* Disk IO.
|Master||
| Created_tmp_disk_tables | 41209 |
| Created_tmp_files | 49 |
| Created_tmp_tables | 1138451 |
|Slave||
| Created_tmp_disk_tables | 17750 |
| Created_tmp_files | 15 |
| Created_tmp_tables | 1403820 |
* Other optimizations to research into in future.
- Concurrent threads
- Buffers (On slave)
- Max connections
- IO wait time for mysql process - if any.
==== could not obtain a database connection within 5 seconds. The max pool size is currently 5; ====
https://tracker.exphosted.com/view.php?id=5696
== Notes ==
- A quick string match through the code base to find any use of class connection_pool's checkout returned zero hits.
- Requested Udaya to ensure that no connections were manually checked out in the codebase.
- It is very likely that we need to increase the number of max connections.
- Udaya mentioned that the pool was increased to 10 connections on QA environment and occurrence of the error message was reduced. Need to verify the associated resource cost.
- Report back with resource cost of increasing the pool by 25.
== Test results ( on dev )==
Factors to consider:
- Limited resources.
- Multiple roles on same machine.
- MySQL RAM usage is not displayed. MySQL CPU and RAM usage was consistently linear with a very small step up corresponding to increase in number of requests.
| Pool Size | Apache + All Passenger processes (private_dirty)MB | CPU | No. of Requests - Concurrent Users |RPS| Response Time Mean|Aborted_clients| Aborted_connects | Notes |
|5| 6.95+940.22|~38-52%|2000-50|15.2|61.333|7|12| Not sure why Passenger is consuming ~200M more here|
|15| 13.82+766.20| same| same|17.17|56.949|3|1||
|23|13.82+798.08| same |same|18.19|58.190|0|0||
|30|20.25+798.98| same | same | 18.17 | 58.2|0|0||
== Recommendations and next steps ==
Wait for a few days to see if Rufus complains on DEV environment. However, we are good to increase the pool to 23, monitor for a few days and then to 30.
== Prod checklist ==
0) Check that no sessions are on the server. \\
1) Bring app server out of rotation. \\
2) Increase max connections in database.yml - 23\\
3) Restart passenger. \\
4) Quick test and put server back in rotation. \\
5) Repeat for the other app server. \\
6) Monitor app server and db logs from this point on. \\
==== Passenger Concurrency optimization * COMPLETED - 7th July * ====
1) Bring app server out of pool. \\
2) Edit /opt/apache2/conf/httpd.conf, add \\
PassengerMaxPoolSize 13
3) Restart Passenger \\
4) Test. \\
5) If no issues: \\
6) Put back app server in pool and loop with the other app server. \\
Rollback: \\
1) Bring app server out of pool \\
2) Remove PassengerMaxPoolSize line. \\
3) Restart passenger (and make the first request). \\
https://tracker.exphosted.com/view.php?id=5683