Learnexa Infra guidebook:

You will need VPN and server access.

App Role:

Configuration: LB→Apache→Passenger

Thresholds:

13 Passenger Instances each.
126 sessions

How?

Total 4core, 8G ram / server.
Key tasks: Video & Doc conversion, Passenger. (Memcached and Apache have very little footprint)
Optimal state: 25% free for OS + 25% for conversion tasks (burstable):
Leaves us with 4G of RAM, which is good.
4096 / 220 (adjust for bloat) ~= 18.61 processes.

Key Commands:

Loadbalancer Stats page - number of sessions currently logged on an app server.
passenger-status – watch out for global queue and instances used.
passenger-memory-stats – watch out for bloating processes
~/bin/apache_sss.sh stop|start
/deploy/systasks/god.sh stop|start

Logs:

Papertrail includes App Logs and System Logs.

Restart:

1) Ensure Users are on this server. See: loadbalancer stats page in key commands.
2) ~/bin/apache_sss.sh stop
3) ~/bin/apache_sss.sh start

Recovery:

No downtime unless all App servers are down, MAINT will be triggered via LB hook.
Zabbix / New relic alerts.
Passenger / Apache restart: < 1 min.
Server restart: < 5 min.

Disable in pool:

~/bin/toggle_server_state.sh enable|disable

Scale:

1) Log on to OpSource CloudUI.
2) Start prodapp03, it will automatically be added to LB once started. This should take about 15 minutes - majority of which will be brief app testing once it comes up. *
3) Gracefully shutdown the app server by checking sessions via lb stats page.+
*This will not scale very well when there is a huge burst but in very near future (2-3 weeks time), app servers will autoscale.
*The way it will function is that a monitoring service will notify an endpoint when
*the number of total session hits 75% availability, provision and boot of an image.

Recurring tasks:

Integration with deploy script to spawn a detached process which will deploy a new version on all dormant app servers incl. DR.

DB Role:

DB Recovery:

Downtime imminent.
Take a backup. Zabbix / New relic alerts.
MySQL restart: < 1 min.
Restart Replication(proper, non broken): < 3 min.
Restart Replication(broken): < 15-20 mins (Db size: ~60M)
Server restart: < 5 min. (+ the above)

Master

Key Commands:

~/bin/mysql_sss.sh stop|start
With Mysql client: show master status;

Logs:

Papertrail includes App Logs and System Logs.
Restart Replication:
* Downtime imminent. *
1) verify that the slave is stopped. Critical.
2) ~/bin/mysql_sss.sh stop
3) ~/bin/mysql_sss.sh start

Restart Master:

* Downtime imminent. *
1) Stop Slave Thread on Slave server.
2) ~/bin/mysql_sss.sh stop
3) ~/bin/mysql_sss.sh start

Scale:

TBD - wiki article: https://wiki.exphosted.com/doku.php/db_optimization

Slave:

Key Commands:

~/bin/mysql_sss.sh stop|start
With Mysql client: show slave status;
With Mysql client: stop slave;
sync

Logs:

Papertrail includes App Logs and System Logs.

Restart:

Downtime imminent.
1) Verify that slave is up to date. Verify log position by issuing 'show master status' on master server and 'show slave status' on slave server and cross checking log position.
2) With Mysql client: stop slave;
3) With Mysql client: show slave status; – verify that the slave threads are stopped.
4) ~/bin/mysql_sss.sh stop
5) ~/bin/mysql_sss.sh start

Restart Slave:

* Downtime imminent. *
1) Stop Slave Thread on Slave server.
2) ~/bin/mysql_sss.sh stop
3) ~/bin/mysql_sss.sh start

Scale:

TBD - https://wiki.exphosted.com/doku.php/db_optimization

BBB Role:

BBB Recovery:

Active LiveEvents loss imminent.
No known recovery mechanism for those or any scheduled during the downtime.
Zabbix / New relic alerts.
BBB restart: < 5 min.
Server restart: < 5 min.
* Supervise *
Zabbix / New relic alerts.

Thresholds:

https://wiki.exphosted.com/doku.php/stress_test

Key Commands:

bbb-conf –check
bbb-conf –restart

Logs:

Papertrail includes App Logs and System Logs.

Restart

Warning: This will end all active meetings.
bbb-conf –restart

Scale:

TBD - sharding seems a possible option or extending MConf library.

Recurring tasks:

2:00 am restart every week.

Solr:

Solr Recovery:

Soft down. Hard down recommended.
* Supervise *
* Zabbix / New relic alerts. *
Tomcat restart: < 1 min.
Reindex: <5min.

On Slave Server

Key Commands:

~/bin/tomcat_sss.sh stop|start

Logs:

Papertrail includes App Logs and System Logs.

Restart:

* Downtime imminent. *
~/bin/tomcat_sss.sh stop
~/bin/tomcat_sss.sh start

Scale:

Discuss on wiki: https://wiki.exphosted.com/doku.php/solr_configuration

Recurring tasks:

Issue optimize statement nightly.

GlusterFS - Streams, Webcam recordings:

GlusterFS Recovery:

* Supervise *
Zabbix / New relic alerts.
GlusterFS restart: < 1 min.

Key Commands:

on BBB server
service glusterd start|stop|restart
mount
chown

Logs:

Papertrail includes App Logs and System Logs.

Restart:

service glusterd restart

Scale:

GlusterFS's core is its distributed file system. Scaling the IOPS is a matter of adding more servers and routing the requests intelligently.
Wiki doc: TBD

GlusterFS - Uploaded :

On DB server

Key Commands: same as above.

Logs:

Papertrail includes App Logs and System Logs.

Restart:same as above.

Scale:same as above.

Memcached:

Shared between app servers. Hosted on App1.

Memcached Recovery:

“God” controlled. “God” alerts. God restart: < 1.5 min.

Key Commands:

/deploy/systasks/god.sh start

Logs:

Papertrail includes App Logs and System Logs.

Restart:

* Critical: As regular user. *
/deploy/systasks/god.sh stop
/deploy/systasks/god.sh start

Scale:

Dalli upgrade.

Haproxy:

Haproxy Recovery:

* supervise * Zabbix / new relic alerts. Restart: < 1 min

Key Commands:

service haproxy reload
service haproxy restart
service haproxy start
echo “COMMAND” | socat stdio /var/run/haproxy/haproxy.sock

Ex - To disable app server 2 - echo “disable server learnexa/prodapp02” | socat stdio /var/run/haproxy/haproxy.sock

To enable app server 2 - echo “enable server learnexa/prodapp02” | socat stdio /var/run/haproxy/haproxy.sock

Logs:

Papertrail includes App Logs and System Logs.

Restart:

service haproxy reload → This will ensure that currently connected clients are not interrupted.

Scale:

DNS level, Active-Active instances.

C*($, my server crashed

Backup

All servers are backed up nightly. Look at the dedicated wiki article for backup policy and step by step instructions
Do an out-of-place upgrade first or (can't stress enough: don't go this route unless: 1) You have backed up the server just before the server went down.
2) You don't care - you should, btw - for loss of data.

Opsource crashed

Backups are delivered over to VMWare Server in SC as well. Details included in backup policy.
We are also working towards DR site so thats a better option.

Table of Contents

Learnexa Infra guidebook:

App Role:

Thresholds:

How?

Key Commands:

Logs:

Restart:

Recovery:

Disable in pool:

Scale:

Recurring tasks:

DB Role:

DB Recovery:

Master

Key Commands:

Logs:

Restart Master:

Scale:

Slave:

Key Commands:

Logs:

Restart:

Restart Slave:

Scale:

BBB Role:

BBB Recovery:

Thresholds:

Key Commands:

Logs:

Restart

Scale:

Recurring tasks:

Solr:

Solr Recovery:

Key Commands:

Logs:

Restart:

Scale:

Recurring tasks:

GlusterFS - Streams, Webcam recordings:

GlusterFS Recovery:

Key Commands:

Logs:

Restart:

Scale:

GlusterFS - Uploaded :

Key Commands: same as above.

Logs:

Restart:same as above.

Scale:same as above.

Memcached:

Memcached Recovery:

Key Commands:

Logs:

Restart:

Scale:

Haproxy:

Haproxy Recovery:

Key Commands:

Logs:

Restart:

Scale:

C*($, my server crashed

Backup

Opsource crashed