How Site Down Due to Heavy Traffic was Handled
The News Minute (TNM) is a digital news platform with headquarters in Bengaluru (India) reporting and writing on issues in India, with a specific focus on the Deccan plateau. Their content includes news reporting, editorial pieces, and blogs. TNM is one of the few online news publishers in English that focuses primarily on news specific to South India to readers across the globe and is quite popular among its viewers. Its increasing popularity posed new challenges. To handle these problems they approached Zyxware Technologies.
A digital news media site based in India approached us to handle the issues they faced caused by the increased popularity of their site. Their site content includes news reporting, editorial pieces, and blogs.
With the media site's increased popularity, they came across issues caused by huge traffic but they were ill-equipped to manage these increasingly frequent server outages. The site administrators noticed that whenever there was an increased spike in the number of simultaneous users, the server immediately crashed requiring the need for server restages leading to site outage. This was the case even when there was breaking news that attracted decent traffic. Continued site outages like this would have led to decreased popularity if it was not been immediately addressed. They were also was looking at improving the site architecture to scale up and increase their user base once the problems with the site were resolved.
After reviewing the current scenario of the news portal, Zyxware decided to do a thorough analysis of the site with prime focus on performance testing. The performance testing was done to locate the bottlenecks in the system and use this data to improve the scalability and performance while also providing a benchmark for the site performance.
To prevent site outage during performance testing it was decided to create a staging server simulating the exact environment of the production server to identify the problems.
Based on the analysis, Zyxware found that the site had a server with high-end configuration but it was not being utilized. An error in the Nginx configuration was also causing Nginx to write to its error logs.
On the Drupal site, the analysis revealed that although Drupal caching was enabled it was not activated due to a session variable issue. Drupal page level, block and view level caching were disabled. The key cause of the server load was identified as high level of insertions and writes occurring to the database due to improper code, site configurations and access log creation. The cron system on the Drupal 7 site was also found to be configured incorrectly.
Overall from the analysis it was clear that the current system was really capable of handling the current load and stress but failed due to lack of optimization. The configuration and implementation of the system needed fine tuning to get maximum output. From the stress test on staging environment, it was clear that the current system could handle up-to 1500 active users or more by fine tuning the server.
Additionally, Zyxware also recommended that Nginx caching needed to be enabled and it was also better to move to a distributed environment to handle the future traffic and to reduce the cost of the server. The use of CDN network or Amazon s3 bucket was also advised, which would be an immediate solution that will help to avoid load to the server due to very large traffic.
Platforms and Tools Used
- Ubuntu Linux
- Nginx server
- Drupal 7
Nginx cache clear - A new Drupal 7 module was released as part of this project which would enable the control of the Nginx cache files from Drupal itself.
Although the initial scope involved only analysis we went ahead and implemented certain recommendations that would bring in immediate improvements with minimal level of effort. This included, resolving issues with page caching, view caching, disabling and truncating access logs and enabling Nginx caching.The implementations yielded positive results and it has made the site light and swift while resolving the grave issue of the site crashing at a mere increase in visitors.
Once these recommendations were implemented it was immediately reflected on the site. The site was first successfully pressure tested soon enough when 9543 users simultaneously accessed the site immediately after a breaking news was released.