===== Load Balancer operation in maintenance context =====
* This article has been modified. To view the older version, click on "old revisions" button and select revision #8  - 2011/08/03 04:18 loadbalancer_maintenance_page – niren

==== Regarding LB ====
HaProxy is the software load balancing backbone in the Learnexa environment. \\ HTTP (and tcp for chat) request directly hit HaProxy which it proxies to an available backend app server.

==== Current Configuration ====
Two application servers (with multiple roles) are behind HaProxy.  


==== Normal Operation ====

Under normal operation, HaProxy performs a health check by probing the web servers with HTTP HEAD probe. HTTP HEAD returns only the HTTP status code for the given URL and does not load an entire HTTP page making the probes more efficient and relevant. \\ 
Efficient -> HTTP HEAD (vs GET) has a minimum impact on the app stack. \\ 
Relevant  -> Our request still goes through the Apache-> Passenger stack ensuring both Apache AND Passenger are functioning. \\ 

HaProxy will expect a HTTP Status Code of 200 to keep the server enabled (alive). On receipt of 2 consecutive failures, the load balancer disables the server from receiving external requests. It enables the server back on receiving a single successful response.

==== During Code Deployment ====
<code>
runcap.sh with deploy option selected
</code>

When the deployment starts, IPTABLES will be configured to reject all requests from load balancer on port 80 on all servers. With all the server down and no proxies left, HaProxy will show a maintenance page.

==== Post Code Deployment ====

<code>
runcap.sh with deploy:web:enable option
</code>

The IPTABLE block rule will be removed for each server. As soon as HaProxy senses the first server up, it will proxy the requests to that server instead of showing a maintenance page.
\\ \\ 

**  What follows is future (soon to be) workflow. This requires a lot of changes to the current deploy file (which are in works). \\ 
**
==== Rolling Upgrades Workflow ====
==== During Code Deployment (without DB changes) ====

<code>
runcap.sh with deploy option selected
</code>

When the deployment starts, IPTABLES will be configured to reject all requests from load balancer on port 80. This will trigger the LB to mark this server down.

This will put one server down while all other servers will continue serving requests. 

==== Post Code Deployment ====

<code>
runcap.sh with deploy:web:enable option
</code>

The IPTABLE block rule will be removed for each server. HaProxy will sense they are up and start functioning normally.

==== Zero downtime deploy using rolling restarts ====
=== Crossbow Builds ===
Since Capistrano works in parallel i.e. executes deploy commands in parallel on App Servers, the commands described in During Code Deployment section above will result in a suboptimal user experience as they will see the maintenance page. \\ 

For production deploys that do not involve any DB Schema changes, we should be able to deploy new builds without any downtime.  Phusion Passenger caches the Rails Framework and application code to memory. 

1) Do not disable the web server before a code update. \\
2) Modify restart task so that it is executed sequentially i.e each app server disabled via iptables, then Passenger process is restarted one-by-one utilizing
<code>
find_servers(:roles => :app).each do |server|
</code>
This task includes making the first web request (to warmup application) and verifying that the homepage loads with a 200 and a unique string which is found in the homepage HTML. \\
Deploys with DB schema changes will still require putting up a maintenance page unless: \\ 
1) DB schema changes are backwards compatible. \\ 
2) Existing DB infrastructure is extended to a multi master. \\ 

With the current traffic, it is confirmed that putting up a maintenance page is the way to go.

=== BBB deploys===
BBB deploys need to be handled. Specifically, there has to be an API to lookup which returns currently running & scheduled meetings before a deploy is initiated. \\ 
An alternate mechanism would be to schedule deploy time(s) and disable end users to schedule a meeting around that time (+ 1 hour / - 1 hour); however, this results in a bad user behavior. 

==== Rollback ====
Rollback in this situation still happens in parallel.