======Learnexa Infra guidebook: ====== You will need VPN and server access. \\ =====App Role:===== Configuration: LB->Apache->Passenger \\ ====Thresholds:==== 13 Passenger Instances each. \\ 126 sessions \\ === How? === Total 4core, 8G ram / server. \\ Key tasks: Video & Doc conversion, Passenger. (Memcached and Apache have very little footprint) \\ Optimal state: 25% free for OS + 25% for conversion tasks (burstable): \\ Leaves us with 4G of RAM, which is good. \\ 4096 / 220 (adjust for bloat) ~= 18.61 processes. \\ ====Key Commands:==== Loadbalancer Stats page - number of sessions currently logged on an app server. \\ passenger-status -- watch out for global queue and instances used. \\ passenger-memory-stats -- watch out for bloating processes \\ ~/bin/apache_sss.sh stop|start \\ /deploy/systasks/god.sh stop|start \\ ====Logs:==== Papertrail includes App Logs and System Logs. \\ ====Restart:==== 1) Ensure Users are on this server. See: loadbalancer stats page in key commands. \\ 2) ~/bin/apache_sss.sh stop \\ 3) ~/bin/apache_sss.sh start \\ ====Recovery:==== No downtime unless all App servers are down, MAINT will be triggered via LB hook. \\ Zabbix / New relic alerts. \\ Passenger / Apache restart: < 1 min. \\ Server restart: < 5 min. \\ ====Disable in pool:==== ~/bin/toggle_server_state.sh enable|disable ====Scale:==== 1) Log on to OpSource CloudUI. \\ 2) Start prodapp03, it will automatically be added to LB once started. This should take about 15 minutes - majority of which will be brief app testing once it comes up. * \\ 3) Gracefully shutdown the app server by checking sessions via lb stats page.+ \\ *This will not scale very well when there is a huge burst but in very near future (2-3 weeks time), app servers will autoscale. \\ *The way it will function is that a monitoring service will notify an endpoint when \\ *the number of total session hits 75% availability, provision and boot of an image. \\ ====Recurring tasks:==== Integration with deploy script to spawn a detached process which will deploy a new version on all dormant app servers incl. DR. \\ =====DB Role:====== ==== DB Recovery:==== Downtime imminent. \\ Take a backup. Zabbix / New relic alerts. \\ MySQL restart: < 1 min. \\ Restart Replication(proper, non broken): < 3 min. \\ Restart Replication(broken): < 15-20 mins (Db size: ~60M) \\ Server restart: < 5 min. (+ the above) \\ =====Master===== ====Key Commands:==== ~/bin/mysql_sss.sh stop|start \\ With Mysql client: show master status; \\ ====Logs:==== Papertrail includes App Logs and System Logs. \\ Restart Replication: \\ * Downtime imminent. * \\ 1) verify that the slave is stopped. Critical. \\ 2) ~/bin/mysql_sss.sh stop \\ 3) ~/bin/mysql_sss.sh start \\ ====Restart Master:==== * Downtime imminent. * \\ 1) Stop Slave Thread on Slave server. \\ 2) ~/bin/mysql_sss.sh stop \\ 3) ~/bin/mysql_sss.sh start \\ ==== Scale: ==== TBD - wiki article: https://wiki.exphosted.com/doku.php/db_optimization \\ =====Slave:===== ====Key Commands:==== ~/bin/mysql_sss.sh stop|start \\ With Mysql client: show slave status; \\ With Mysql client: stop slave; \\ sync \\ ====Logs:==== Papertrail includes App Logs and System Logs. \\ ====Restart:==== Downtime imminent. \\ 1) Verify that slave is up to date. Verify log position by issuing 'show master status' on master server and 'show slave status' on slave server and cross checking log position. \\ 2) With Mysql client: stop slave; \\ 3) With Mysql client: show slave status; -- verify that the slave threads are stopped. \\ 4) ~/bin/mysql_sss.sh stop \\ 5) ~/bin/mysql_sss.sh start \\ ====Restart Slave:==== * Downtime imminent. * \\ 1) Stop Slave Thread on Slave server. \\ 2) ~/bin/mysql_sss.sh stop \\ 3) ~/bin/mysql_sss.sh start \\ ==== Scale: ==== TBD - https://wiki.exphosted.com/doku.php/db_optimization \\ =====BBB Role:===== ==== BBB Recovery:==== Active LiveEvents loss imminent. \\ No known recovery mechanism for those or any scheduled during the downtime. \\ Zabbix / New relic alerts. \\ BBB restart: < 5 min. \\ Server restart: < 5 min. \\ * Supervise * \\ Zabbix / New relic alerts. \\ ====Thresholds:==== https://wiki.exphosted.com/doku.php/stress_test \\ ====Key Commands:==== bbb-conf --check \\ bbb-conf --restart \\ ====Logs:==== Papertrail includes App Logs and System Logs. \\ ====Restart==== Warning: This will end all active meetings. \\ bbb-conf --restart \\ ====Scale:==== TBD - sharding seems a possible option or extending MConf library. \\ =====Recurring tasks:==== 2:00 am restart every week. \\ =====Solr:===== ==== Solr Recovery:==== Soft down. Hard down recommended. \\ * Supervise * \\ * Zabbix / New relic alerts. * \\ Tomcat restart: < 1 min. \\ Reindex: <5min. \\ On Slave Server \\ ====Key Commands:==== ~/bin/tomcat_sss.sh stop|start \\ ====Logs:==== Papertrail includes App Logs and System Logs. \\ ====Restart:==== * Downtime imminent. * \\ ~/bin/tomcat_sss.sh stop \\ ~/bin/tomcat_sss.sh start \\ ====Scale:==== Discuss on wiki: https://wiki.exphosted.com/doku.php/solr_configuration \\ ====Recurring tasks:==== Issue optimize statement nightly. \\ =====GlusterFS - Streams, Webcam recordings:===== ==== GlusterFS Recovery:==== * Supervise * \\ Zabbix / New relic alerts. \\ GlusterFS restart: < 1 min. \\ ====Key Commands:==== on BBB server \\ service glusterd start|stop|restart \\ mount \\ chown \\ ====Logs:==== Papertrail includes App Logs and System Logs. \\ ====Restart:==== service glusterd restart \\ ====Scale:==== GlusterFS's core is its distributed file system. Scaling the IOPS is a matter of adding more servers and routing the requests intelligently. \\ Wiki doc: TBD =====GlusterFS - Uploaded :===== On DB server \\ ====Key Commands: same as above.==== ====Logs:==== Papertrail includes App Logs and System Logs. \\ ====Restart:same as above.===== ====Scale:same as above.==== =====Memcached:===== Shared between app servers. Hosted on App1. \\ ==== Memcached Recovery:==== "God" controlled. "God" alerts. God restart: < 1.5 min. \\ ====Key Commands:==== /deploy/systasks/god.sh start \\ ====Logs:==== Papertrail includes App Logs and System Logs. \\ ====Restart:==== * Critical: As regular user. * \\ /deploy/systasks/god.sh stop \\ /deploy/systasks/god.sh start \\ ====Scale:==== Dalli upgrade. \\ =====Haproxy:===== ==== Haproxy Recovery:==== * supervise * Zabbix / new relic alerts. Restart: < 1 min ====Key Commands:==== service haproxy reload \\ service haproxy restart \\ service haproxy start \\ echo "COMMAND" | socat stdio /var/run/haproxy/haproxy.sock \\ Ex - To disable app server 2 - echo "disable server learnexa/prodapp02" | socat stdio /var/run/haproxy/haproxy.sock To enable app server 2 - echo "enable server learnexa/prodapp02" | socat stdio /var/run/haproxy/haproxy.sock ====Logs:==== Papertrail includes App Logs and System Logs. \\ ====Restart:==== service haproxy reload -> This will ensure that currently connected clients are not interrupted. \\ ====Scale:==== DNS level, Active-Active instances. \\ ====== C*($, my server crashed ====== ===== Backup ===== All servers are backed up nightly. Look at the dedicated wiki article for backup policy and step by step instructions \\ Do an out-of-place upgrade first or (can't stress enough: don't go this route unless: 1) You have backed up the server just before the server went down. \\ 2) You don't care - you should, btw - for loss of data. \\ ===== Opsource crashed ===== Backups are delivered over to VMWare Server in SC as well. Details included in backup policy. \\ We are also working towards DR site so thats a better option. \\