Hotfix release available: 2025-05-14b "Librarian". upgrade now! [56.2] (what's this?)
Hotfix release available: 2025-05-14a "Librarian". upgrade now! [56.1] (what's this?)
New release available: 2025-05-14 "Librarian". upgrade now! [56] (what's this?)
Hotfix release available: 2024-02-06b "Kaos". upgrade now! [55.2] (what's this?)
Hotfix release available: 2024-02-06a "Kaos". upgrade now! [55.1] (what's this?)
New release available: 2024-02-06 "Kaos". upgrade now! [55] (what's this?)
Hotfix release available: 2023-04-04b "Jack Jackrum". upgrade now! [54.2] (what's this?)
Hotfix release available: 2023-04-04a "Jack Jackrum". upgrade now! [54.1] (what's this?)
New release available: 2023-04-04 "Jack Jackrum". upgrade now! [54] (what's this?)
Hotfix release available: 2022-07-31b "Igor". upgrade now! [53.1] (what's this?)
Hotfix release available: 2022-07-31a "Igor". upgrade now! [53] (what's this?)
New release available: 2022-07-31 "Igor". upgrade now! [52.2] (what's this?)
New release candidate 2 available: rc2022-06-26 "Igor". upgrade now! [52.1] (what's this?)
New release candidate available: 2022-06-26 "Igor". upgrade now! [52] (what's this?)
Hotfix release available: 2020-07-29a "Hogfather". upgrade now! [51.4] (what's this?)
New release available: 2020-07-29 "Hogfather". upgrade now! [51.3] (what's this?)
New release candidate 3 available: 2020-06-09 "Hogfather". upgrade now! [51.2] (what's this?)
New release candidate 2 available: 2020-06-01 "Hogfather". upgrade now! [51.1] (what's this?)
New release candidate available: 2020-06-01 "Hogfather". upgrade now! [51] (what's this?)
Hotfix release available: 2018-04-22c "Greebo". upgrade now! [50.3] (what's this?)
Hotfix release available: 2018-04-22b "Greebo". upgrade now! [50.2] (what's this?)
scalable_delayed_job

Problem Statement

Right now the content conversion process (i.e. document conversion to PDF, SWF and png OR video conversion to various formats) is done by queuing the conversion process in delayed job. The delayed job instance runs on the application server, which is responsible for serving web requests. When larger files are uploaded, the conversion process uses up bulk of memory and cpu because of which the application server is not able to process the web requests.

Required solution

To device a scalable solution to distribute the processing over different servers

1) To make sure that the longer running OR memory consuming jobs do not affect the web request/response process.

2) To make sure that conversion process does not require a single server with very large memory and cpu requirement. Rather the processing load should be distributed amongst multiple servers.

Solution

Refer to the diagram below

  • Unlike what we have today, where the delayed job instance is running in the application server itself, we will have seperate machines which will host only the delayed job instance (call it worker server).
  • The shared filesystem will be available to application/web server as well as the the worker servers.
  • Delayed Job instances will not run on application/web server. This will mean that the web server machine's responsibility will be only to server the web requests and nothing else.
  • All the worker servers should have to have synced clock.
  • Since the job queue is maintained in the database (which is central), even though we have multiple worker instances running, only one worker will lock the job and process it. In the mean time, if another job is added to the queue, the another worker instance will pick it.
  • Note that the worker servers need not have the web server (Apache, mongrel etc…) running. It simply needs the rails application and required dependencies (i.e. ruby and rails framework). Hence the job of worker servers will be to only process the queued jobs.

Comments

Following are the things be considered to scale up this setup -

1. Most of the large files could be videos and video conversion is CPU heavy. So, we need at least 8 Cpu or 12 CPU core machines.

2. We need to add one more slave database and scale the master if needed so that replication is smooth.

3. The distributed file system should be really scaling well to handle all the reads and writes as there will be about 5 machines hitting it at all the time. We are considering hadoop dfs as of now. This seems good and needs to be fine tuned to the best possible extent.

scalable_delayed_job.txt · Last modified: 2018/08/31 16:16 (external edit)