Serve Unresponsive with very high Disk Throughput

Keywords: WordPress Multisite - Google Cloud Platform - Technical issue - Other
bnsupport ID: 88feb93f-8296-0759-9efb-6382bd7c1631
Description:
Basically as the title says. My site became unresponsive at almost exactly 1pm local time. I’m guessing some service must have kicked in or maybe I was under attack?

My disk throughput on the read was constantly maxed out and it made it impossible to connect to the server on any port. I just kept getting connection timeouts on http, ssh and ftp. Tried ssh on both putty and the gcp’s browser based terminals.

I had no choice but to completely shut down the VM and restart it, which messed up my IP provisioning and a bunch of wordpress siteurl definitions in the mysql database (this is another can of worms I’ve been trying to figure out for a while).

This isn’t the first time this has happened either. Are there logs or something I can look at to figure out root cause?

I went through what I could see in /var/log/ but none of the logs there indicated anything happening at around that specific time.

Here’s a screenshot of my compute engine monitoring

Hello @dahr.chan,

From what Apache’s logs show, there are a couple of plugins with high activity in the timeframes you specify:

[Fri Mar 12 12:29:33.653848 2021] [pagespeed:warn] [pid 27785:tid 140056098223872] [mod_pagespeed 1.13.35.2-0 @27785] RateController: drop deferred fetch of https://10.180.0.2/wp-content/plugins/event-tickets/common/src/resources/css/common-full.min.css?ver=4.12.15.1 on shutdown
[Fri Mar 12 12:29:33.653879 2021] [pagespeed:warn] [pid 27785:tid 140056098223872] [mod_pagespeed 1.13.35.2-0 @27785] RateController: drop deferred fetch of https://10.180.0.2/wp-content/plugins/the-events-calendar/vendor/bootstrap-datepicker/css/bootstrap-datepicker.standalone.min.css?ver=5.3.1.1 on shutdown
[Fri Mar 12 12:29:33.653909 2021] [pagespeed:warn] [pid 27785:tid 140056098223872] [mod_pagespeed 1.13.35.2-0 @27785] RateController: drop deferred fetch of https://10.180.0.2/wp-content/plugins/event-tickets/common/vendor/tooltipster/tooltipster.bundle.min.css?ver=4.12.15.1 on shutdown
[Fri Mar 12 12:29:33.653939 2021] [pagespeed:warn] [pid 27785:tid 140056098223872] [mod_pagespeed 1.13.35.2-0 @27785] RateController: drop deferred fetch of https://10.180.0.2/wp-content/plugins/the-events-calendar/src/resources/css/views-skeleton.min.css?ver=5.3.1.1 on shutdown
[Fri Mar 12 12:29:33.653967 2021] [pagespeed:warn] [pid 27785:tid 140056098223872] [mod_pagespeed 1.13.35.2-0 @27785] RateController: drop deferred fetch of https://10.180.0.2/wp-content/plugins/the-events-calendar/src/resources/css/views-full.min.css?ver=5.3.1.1 on shutdown

Do these two plugins have a configuration or use a cronjob to be run at those hours? I would recommend disabling them and checking whether the performance issues persist. Apart from that, you can also try increasing your instance memory, as a High Disk throughput often comes with high memory use and that should be what is making your machine freeze.

I haven’t seen anything pointing to a bots attack though you can take a look at our Deny Connections From Bots/Attackers Using Apache guide and check whether there is a high number of accesses from a single IP after your peaks.

Regards,
Francisco de Paz

I don’t think they run on cron. When I look at grep CRON /var/log/syslog I only see hourly instances of:

Mar 16 14:17:01 maker-cube-multisite-vm CRON[15014]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Mar 16 14:24:01 maker-cube-multisite-vm CRON[15276]: (bitnami) CMD (cd /opt/bitnami/stats && ./agent.bin --run -D)

and my /etc/cron.hourly is empty

I use cloudflare for some protection and I can see on that day I had about 44k uncached requests which is double if not triple a normal day. Looking further back I did also have 190k requests on the first day of the month =/

When I compare that day to my Jetpack stats, I only actually had 111 page views amongst 40 vistors (I don’t have that much traffic on my site and don’t expect it be busy).

I recall getting pagespeed warnings a lot in the past too (that’s basically the majority of the Apache logs…) Going further into those logs, I can see that there are about 100 errors with the same timestamp and based on the images it’s loading appears to be 7 different event pages being fetched. Half of these requests alone are images which I can assume is just massive bandwidth.

I’m not entirely sure how related this is since this happens at ~12:29 but from the GCP charts you can see things start to fall apart closer to 1 pm.

The last errors on the log during this high peak and before I shut it down are:

[Fri Mar 12 12:50:09.042774 2021] [pagespeed:error] [pid 4930:tid 140056106616576] [mod_pagespeed 1.13.35.2-0 @4930] PageSpeed Serf fetch failure rate extremely high; only 0 of 671 recent fetches fully successful; is fetching working?
[Fri Mar 12 13:22:40.829719 2021] [pagespeed:error] [pid 4930:tid 140056106616576] [mod_pagespeed 1.13.35.2-0 @4930] PageSpeed Serf fetch failure rate extremely high; only 0 of 951 recent fetches fully successful; is fetching working?
[Fri Mar 12 13:55:27.952452 2021] [pagespeed:error] [pid 4930:tid 140056106616576] [mod_pagespeed 1.13.35.2-0 @4930] PageSpeed Serf fetch failure rate extremely high; only 0 of 736 recent fetches fully successful; is fetching working?
[Fri Mar 12 14:37:02.075591 2021] [pagespeed:error] [pid 4930:tid 140056106616576] [mod_pagespeed 1.13.35.2-0 @4930] PageSpeed Serf fetch failure rate extremely high; only 0 of 611 recent fetches fully successful; is fetching working?
[Fri Mar 12 15:08:49.787634 2021] [pagespeed:error] [pid 28315:tid 140056106616576] [mod_pagespeed 1.13.35.2-0 @28315] PageSpeed Serf fetch failure rate extremely high; only 0 of 843 recent fetches fully successful; is fetching working?
[Fri Mar 12 15:38:53.340430 2021] [pagespeed:error] [pid 4930:tid 140056106616576] [mod_pagespeed 1.13.35.2-0 @4930] PageSpeed Serf fetch failure rate extremely high; only 0 of 546 recent fetches fully successful; is fetching working?

haha… wow.

Could it be a bot crawling my site?

I’m looking at the access logs and checking the IPs. The last couple to request is Cloudflare itself.

Not just that, but it appears to be requesting /wp-cron.php?doing_wp_cron=xxxxxx

I went through my wp-cron to see if there was anything in particular it was trying to run, but I have no idea what the xxxx code at the end means.

An example of the request in the access logs is like:
108.162.215.8 - - [12/Mar/2021:12:55:21 +0000] “POST /wp-cron.php?doing_wp_cron=1615553720.8986599445343017578125 HTTP/1.1” 200 -

I’ve attached an image of the events in wp-cron.

Hello @dahr.chan,

There is a thread in the official forums talking about the action_scheduler_run_queue event that could be of help. I would also try disabling your active plugins to ascertain whether the issue is caused by a plugin.

Regards,
Francisco de Paz

This is a production server and I can’t disable those plugins, they run most of the show.

From what I can tell action_scheduler is a woocommerce event running every minute which in itself has a schedule of events that are running every few minutes. The majority of those events are aioseo_image_sitemap_scan.

What I’m trying to understand is whether this is typical or not. The developers claim this shouldn’t be a performance hog and is normal to have it run regularly. If it’s disabled it would break certain features.

I posted into the wordpress forums to figure out if it’s poor implementation of wp-cron itself, or the stuff inside it. It could also be some weird effect of cloudflare doing a POST to wp-cron (why would it need to?)

The other possibility is that all this is normal and I haven’t allocated enough resources to the server and it’s choking when a bot crawls it. I’m currently only running with a single vCPU and 1.7GB of memory.

I finally recently got GCP monitoring set up and I can see that my memory utilization on regular usage is constantly above 75%. I didn’t think my site was that overly complex with not too many users on it. Wondering if this amount of resource usage is normal and I should increase my allocation so it has room to breathe or if this is abnormally high and some process ruining ts performance.

I did a bit of poking around and ran the Top command in CLI. Found that mysql is using 43% of the memory all the time.
I tried digging further into what exactly it was doing but didn’t find much. There weren’t any active queries. I tried running

mysql> SHOW ENGINE INNODB STATUS\G

but I don’t know how to interpret the response. I can dump it here if it helps.

Not sure if it’s related but I’m getting a lot of pagespeed fetch warnings

[Wed Mar 17 18:28:54.422457 2021] [pagespeed:error] [pid 17709:tid 139621987768064] [mod_pagespeed 1.13.35.2-0 @1770PageSpeed Serf fetch failure rate extremely high; only 0 of 713 recent fetches fully successful; is fetching working
[Wed Mar 17 18:59:05.074623 2021] [pagespeed:error] [pid 3835:tid 139621979375360] [mod_pagespeed 1.13.35.2-0 @3835]geSpeed Serf fetch failure rate extremely high; only 0 of 1077 recent fetches fully successful; is fetching working?
[Wed Mar 17 19:25:08.728197 2021] [pagespeed:error] [pid 6143:tid 139622373832448] [mod_pagespeed 1.13.35.2-0 @6143]ow ReadFile operation on file /opt/bitnami/apache2/var/cache/mod_pagespeed/!clean!time!: 59.303ms; configure SlowFiltencyUs to change threshold\n
[Wed Mar 17 19:31:05.274549 2021] [pagespeed:error] [pid 3835:tid 139621979375360] [mod_pagespeed 1.13.35.2-0 @3835]geSpeed Serf fetch failure rate extremely high; only 0 of 745 recent fetches fully successful; is fetching working?

I managed to get into the pagespeed console and it’s not very pretty…

Some quick initial searches indicate misconfigured http->https redirection or missing includes

Still investigating…

I ended up expanding my memory allocation to 2.75GB on the instance. During that time I had to restart the VM. Prior to shutdown Mysqld was using 750mb of memory, after restart it was using around 450mb.

With regards to the pagespeed issue, after restart I didn’t see much change. On closer inspection ALL fetches are failing so there’s probably something wrong with it’s config.

Hello @dahr.chan,

Thanks for sharing with everyone your investigation! The drop in memory usage by MySQL could be coming from zombie processes and a restart helped with them. Regarding pagespeed, it is known to be unstable in some cases and a solution for this would be disabling pagespeed’s module and opting for a cache-related WP plugin.

Do not doubt to check back with us if you experience a significant increase in memory usage again!

Regards,
Francisco de Paz

After about 12 hours mysqld grew back to about 600mb and after several days it’s back up to 750mb. I’m guessing that’s just how much it needs with the way my site is set up.

I did disable pagespeed so I guess I don’t have to worry about any errors there now.

I ended up choosing WP-Optimize instead for local caching, but now have to deal with some php errors from that plugin.

Not sure if it’s because I started to clear out most of my errors or these just started to occur while I was troubleshooting but I noticed these errors started to roll in on my log.

[Sun Mar 21 10:04:05.740413 2021] [authz_core:error] [pid 20240:tid 139858756364032] [client 117.175.144.238:44310] AH01630: client denied by server configuration: /opt/bitnami/apps/phpmyadmin/htdocs/

[Wed Mar 17 08:56:18.534845 2021] [authz_core:error] [pid 4540:tid 139622508115712] [client 116.59.26.211:65151] AH01630: client denied by server configuration: /opt/bitnami/apps/phpmyadmin/htdocs/

Is this what an attack looks like? I’m assuming it’s an attack based on the fact they’re trying to get into phpmyadmin. It shows up about 100 times in the span of 1 second. The client IP is based out of the Asia Pacific network with no other information and I had a similar instance of reports from a few days earlier by a different IP also from the same network. Not sure if I can block it because it doesn’t appear to be from a specific country.

I guess I shouldn’t really be that worried since phpmyadmin is accessible by local connection only.

It does make me wonder what else could be vulnerable though. I already geoblocked a few countries through cloudflare and put a JS Challenge on anything outside my country. I didn’t think they’d be able to ping my server for phpmyadmin in the first place.

Hello @dahr.chan,

From what you are explaining, it seems indeed like a bot attack and blocking access from those IPs should suffice. As you are using Cloudflare, I recommend configuring access using it but we have a guide on how to Deny Connections From Bots/Attackers Using Apache if you need to or just want to browse it.

Regarding your concern, you don’t need to worry as every service is indeed only accessible by localhost, and external access is regulated by their configuration.

Regards,
Francisco de Paz

Yeah, it’s hard to tell if it’s a bot or not. The JS challenges should be preventing them from hitting it.

I have another support thread on the Cloudflare forums and they suggested the bots/attackers are hitting the IP address directly. They suggested that I configure my apache to whitelist requests from only Cloudflare server IPs and block everything else. Is this advisable?

I don’t really quite understand how traffic is distributed with cloudflare and if all of it goes through it or if I should expect regular traffic to somehow bypass it.

I can’t really blacklist IPs as per the guide you linked because it appears to be a different IP attacking every time.

Hello @dahr.chan,

It shouldn’t be a problem, as all the traffic is passing through Cloudflare. It is possible you may need to make changes to your SSL configuration though. If so, please check our Enable SSL Access Over HTTPS With Cloudflare and Troubleshoot SSL Issues.

If you encounter any issues configuring WordPress with Cloudflare, please open a new thread so that we can keep a 1 thread=1 issue organization to help the community find similar solutions easily.

Regards,
Francisco de Paz