You will need VPN and server access.
Configuration: LB→Apache→Passenger
13 Passenger Instances each.
126 sessions
Total 4core, 8G ram / server.
Key tasks: Video & Doc conversion, Passenger. (Memcached and Apache have very little footprint)
Optimal state: 25% free for OS + 25% for conversion tasks (burstable):
Leaves us with 4G of RAM, which is good.
4096 / 220 (adjust for bloat) ~= 18.61 processes.
Loadbalancer Stats page - number of sessions currently logged on an app server.
passenger-status – watch out for global queue and instances used.
passenger-memory-stats – watch out for bloating processes
~/bin/apache_sss.sh stop|start
/deploy/systasks/god.sh stop|start
Papertrail includes App Logs and System Logs.
1) Ensure Users are on this server. See: loadbalancer stats page in key commands.
2) ~/bin/apache_sss.sh stop
3) ~/bin/apache_sss.sh start
No downtime unless all App servers are down, MAINT will be triggered via LB hook.
Zabbix / New relic alerts.
Passenger / Apache restart: < 1 min.
Server restart: < 5 min.
~/bin/toggle_server_state.sh enable|disable
1) Log on to OpSource CloudUI.
2) Start prodapp03, it will automatically be added to LB once started. This should take about 15 minutes - majority of which will be brief app testing once it comes up. *
3) Gracefully shutdown the app server by checking sessions via lb stats page.+
*This will not scale very well when there is a huge burst but in very near future (2-3 weeks time), app servers will autoscale.
*The way it will function is that a monitoring service will notify an endpoint when
*the number of total session hits 75% availability, provision and boot of an image.
Integration with deploy script to spawn a detached process which will deploy a new version on all dormant app servers incl. DR.
Downtime imminent.
Take a backup.
Zabbix / New relic alerts.
MySQL restart: < 1 min.
Restart Replication(proper, non broken): < 3 min.
Restart Replication(broken): < 15-20 mins (Db size: ~60M)
Server restart: < 5 min. (+ the above)
~/bin/mysql_sss.sh stop|start
With Mysql client: show master status;
Papertrail includes App Logs and System Logs.
Restart Replication:
* Downtime imminent. *
1) verify that the slave is stopped. Critical.
2) ~/bin/mysql_sss.sh stop
3) ~/bin/mysql_sss.sh start
* Downtime imminent. *
1) Stop Slave Thread on Slave server.
2) ~/bin/mysql_sss.sh stop
3) ~/bin/mysql_sss.sh start
TBD - wiki article: https://wiki.exphosted.com/doku.php/db_optimization
~/bin/mysql_sss.sh stop|start
With Mysql client: show slave status;
With Mysql client: stop slave;
sync
Papertrail includes App Logs and System Logs.
Downtime imminent.
1) Verify that slave is up to date. Verify log position by issuing 'show master status' on master server and 'show slave status' on slave server and cross checking log position.
2) With Mysql client: stop slave;
3) With Mysql client: show slave status; – verify that the slave threads are stopped.
4) ~/bin/mysql_sss.sh stop
5) ~/bin/mysql_sss.sh start
* Downtime imminent. *
1) Stop Slave Thread on Slave server.
2) ~/bin/mysql_sss.sh stop
3) ~/bin/mysql_sss.sh start
Active LiveEvents loss imminent.
No known recovery mechanism for those or any scheduled during the downtime.
Zabbix / New relic alerts.
BBB restart: < 5 min.
Server restart: < 5 min.
* Supervise *
Zabbix / New relic alerts.
bbb-conf –check
bbb-conf –restart
Papertrail includes App Logs and System Logs.
Warning: This will end all active meetings.
bbb-conf –restart
TBD - sharding seems a possible option or extending MConf library.
2:00 am restart every week.
Soft down. Hard down recommended.
* Supervise *
* Zabbix / New relic alerts. *
Tomcat restart: < 1 min.
Reindex: <5min.
On Slave Server
~/bin/tomcat_sss.sh stop|start
Papertrail includes App Logs and System Logs.
* Downtime imminent. *
~/bin/tomcat_sss.sh stop
~/bin/tomcat_sss.sh start
Discuss on wiki: https://wiki.exphosted.com/doku.php/solr_configuration
Issue optimize statement nightly.
* Supervise *
Zabbix / New relic alerts.
GlusterFS restart: < 1 min.
on BBB server
service glusterd start|stop|restart
mount
chown
Papertrail includes App Logs and System Logs.
service glusterd restart
GlusterFS's core is its distributed file system. Scaling the IOPS is a matter of adding more servers and routing the requests intelligently.
Wiki doc: TBD
On DB server
Papertrail includes App Logs and System Logs.
Shared between app servers. Hosted on App1.
“God” controlled.
“God” alerts.
God restart: < 1.5 min.
/deploy/systasks/god.sh start
Papertrail includes App Logs and System Logs.
* Critical: As regular user. *
/deploy/systasks/god.sh stop
/deploy/systasks/god.sh start
Dalli upgrade.
* supervise * Zabbix / new relic alerts. Restart: < 1 min
service haproxy reload
service haproxy restart
service haproxy start
echo “COMMAND” | socat stdio /var/run/haproxy/haproxy.sock
Ex - To disable app server 2 - echo “disable server learnexa/prodapp02” | socat stdio /var/run/haproxy/haproxy.sock
To enable app server 2 - echo “enable server learnexa/prodapp02” | socat stdio /var/run/haproxy/haproxy.sock
Papertrail includes App Logs and System Logs.
service haproxy reload → This will ensure that currently connected clients are not interrupted.
DNS level, Active-Active instances.
All servers are backed up nightly. Look at the dedicated wiki article for backup policy and step by step instructions
Do an out-of-place upgrade first or (can't stress enough: don't go this route unless:
1) You have backed up the server just before the server went down.
2) You don't care - you should, btw - for loss of data.
Backups are delivered over to VMWare Server in SC as well. Details included in backup policy.
We are also working towards DR site so thats a better option.