How can I troubleshoot Umbraco/dotnet refusing connections? Does Umbraco rate limit by default?

TackleMcClean 🏅

08/14/2024, 3:55 PM

We have a large site, thousands of pages, previously running linux dotnet on Ubuntu VPS, now running on also linux dotnet but in Azure. Our separate Nextjs frontend is making requests to Umbraco, both to UHeadless graphql and to

/media

images for Nextjs next-image consumption. We seem to very often overload Umbraco/dotnet, resulting in Nextjs

fetch

requests that returns:

Copy code

TypeError: fetch failed
    at Object.fetch (node:internal/deps/undici/undici:11372:11)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async executor (webpack-internal:///./src/functions/graphql/getRemoteSchema.js:16:29)
    at async Promise.all (index 1)
    at async Promise.all (index 1)
    at async Promise.all (index 1)
    at async Promise.all (index 0)
    at async Promise.all (index 3)
    at async Promise.all (index 4)
    at async Promise.all (index 124) {
  cause: Error: connect ECONNREFUSED 192.168.65.254:8080
      at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1595:16) {
    errno: -111,
    code: 'ECONNREFUSED',
    syscall: 'connect',
    address: '192.168.65.254',
    port: 8080
  }
}

This is actually from local dev environment so we can recreate this. But prod shows similar as well. Does anyone know how to troubleshoot something like this? I don't know if either dotnet by default, or Umbraco, rate limits requests? Our RAM and CPU consumption is all fine below 50% on both. No error output from the

dotnet run

process. No errors in `umbraco/Logs`/Log viewer either. We are at a loss with intermittent site outage. Any suggestion is appreciated!

D_Inventor

08/14/2024, 4:57 PM

Can you also reproduce this behaviour without nextjs?

bielu

08/14/2024, 5:24 PM

is it on azure webapps? if so, you can solve it only by usage of vnet, i experience quite a lot issues when using Nuxt SRR with .net api on azure, there is limit of connections through public api and only work around is to use vnets 😬

TackleMcClean 🏅

08/14/2024, 6:42 PM

Yes we're actually already using a vnet, so I don't think we've run into that issue. Plus we can trigger the error on local dev machines.

TackleMcClean 🏅

08/14/2024, 6:42 PM

Great question, I'm trying to set up a somewhat isolated scenario. Have to have some form of automation to make all the requests. Will try to verify this.

D_Inventor

08/14/2024, 6:45 PM

Do you know k6? I use it if I want to fire a lot of requests to a server. It's a javascript based cli tool: https://k6.io/

TackleMcClean 🏅

08/14/2024, 6:46 PM

Haven't tried that. I was going to try out if Postman can trigger lots of requests, and I've also used Locust, a python based load tester

bielu

08/14/2024, 6:51 PM

@TackleMcClean 🏅 are you sure you setup app to force all trafić through vent? As by default it will use public network even if private is available

TackleMcClean 🏅

08/14/2024, 6:52 PM

Hmm, then probably no, we've only made sure frontend and backend share a vnet, will investigate this when I've drained the local machine avenues of investigating

TackleMcClean 🏅

08/14/2024, 6:52 PM

Good tip!

bielu

08/14/2024, 6:53 PM

I spent good amount of time debugging it just to learn it, don't remember setting now but yeah it was painful to find it

bielu

08/14/2024, 7:19 PM

It was WEBSITE_VNET_ROUTE_ALL=1 but I can see azure have new not well documented options for it 😂😂

TackleMcClean 🏅

08/14/2024, 7:36 PM

Trying locust, to fetch both my sitemap and a random image I can trigger failed requests, and it clearly seems to be a "connection refused" as a result. This is not a 1:1 scenario, and in this case my desktop here is running at 100% cpu, so it's very synthetic, but it does tell one thing: if the system is overloaded you might see connection refused as a response. Still, the production environment rarely goes over 50% load, but maybe it's a matter of us DDOS:ing ourselves..

TackleMcClean 🏅

08/14/2024, 7:36 PM

I'm just so confused to why the dotnet process can't indicate this is happening whatsoever, it just fails quitely.

TackleMcClean 🏅

08/14/2024, 7:39 PM

https://cdn.discordapp.com/attachments/1273308975709814857/1273365386389360702/image.png?ex=66be596e&is=66bd07ee&hm=2070a76a9349efd5ccf98cac9edb3f9d82349a87f74ace8aa5c5532629818394&

TackleMcClean 🏅

08/14/2024, 8:10 PM

If CPU is running at 100% ,should I expect Umbraco to drop connections? Or should I expect asp.net to drop connections? Or not at all?

Nik

08/14/2024, 11:13 PM

In that scenario, I would expect the OS or dotnet to be unable to accept the incoming connecitions so they are just "lost". The OS has no resources for it to provide to the incoming connection so it would just fail. I've seen this before when loading testing sites and one or more resource on the host maxes out.

TackleMcClean 🏅

08/15/2024, 6:19 AM

Is this the expected approach though? I would expect it to simply hold these connections in a queue and then process them one by one. Would putting a load balancer like nginx in front solve the issue maybe?

kdx-perbol

08/15/2024, 7:06 AM

WEBSITE_VNET_ROUTE_ALL

routes all outgoing traffic through the virtual network. https://learn.microsoft.com/en-us/azure/app-service/configure-vnet-integration-routing

kdx-perbol

08/15/2024, 7:08 AM

How are you running Umbraco in development? In a container? With IISExpress/Kestrel?

kdx-perbol

08/15/2024, 7:08 AM

I've never seen this happen even on heavily loaded Umbraco sites, so I think there's something unusual about your setup somehow.

bielu

08/15/2024, 7:09 AM

@kdx-perbol it will affect also internal connections app to app and it is only way force internal traffic to do it.

TackleMcClean 🏅

08/15/2024, 7:09 AM

Through windows powershell, in windows, I run

dotnet run

in my project folder. From what I know this only invokes Kestrel.

kdx-perbol

08/15/2024, 7:10 AM

It does. And you get "connection refused" from the frontend app when doing this? And you're sure those requests reach Kestrel?

kdx-perbol

08/15/2024, 7:12 AM

Is "UHeadless graphql" on :8080? Is it only those requests that are affected? Or requests to Umbraco as well?

TackleMcClean 🏅

08/15/2024, 7:12 AM

Well, I guess that depends on how we define "reach", it is supposed to reach it but from what I've read, a connection refused basically means the connection is never established. For example, here's nextjs trying to connect to Umbraco, but the fetch fails with:

Copy code

TypeError: fetch failed
    at Object.fetch (node:internal/deps/undici/undici:11372:11)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async executor (webpack-internal:///./src/functions/graphql/getRemoteSchema.js:19:29)
    at async Promise.all (index 1)
    at async Promise.all (index 1)
    at async Promise.all (index 1)
    at async Promise.all (index 0)
    at async Promise.all (index 3)
    at async Promise.all (index 4)
    at async Promise.all (index 93) {
  cause: Error: connect ECONNREFUSED 192.168.65.254:8080
      at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1595:16) {
    errno: -111,
    code: 'ECONNREFUSED',
    syscall: 'connect',
    address: '192.168.65.254',
    port: 8080
  }
}

TackleMcClean 🏅

08/15/2024, 7:13 AM

Yes Uheadless graphql responds on the same port, it's its own route in dotnet/umbraco

TackleMcClean 🏅

08/15/2024, 7:14 AM

When overloading the site with Locust, I get connection refused, and my sample request urls are only images and a sitemap (razor template), so in that case nextjs/nodejs is not in the equation as well.

bielu

08/15/2024, 7:14 AM

192.168.65.254 is internall address i just noticed, interesting so your app is blocked on internal connection 🤔 , also if it is in azure why it would use 8080? webapp expose only 80 or ssl 443?

TackleMcClean 🏅

08/15/2024, 7:15 AM

@bielu Sorry I forgot to mention, 192.168.65.254 is basically

host.docker.internal

, so the error above comes from a docker container talking to my "real" localhost, that's why you see that IP address. From my browser I reach Umbraco from

localhost:8080

TackleMcClean 🏅

08/15/2024, 7:15 AM

This error above is from local development

TackleMcClean 🏅

08/15/2024, 7:16 AM

To give an example of errors happening in prod while we're talking, we are 99% of cases since yesterday seeing connection refused when nextjs is trying to get images from umbraco. Then it looks like:

Copy code

2024-08-15T07:10:23.2190203Z TypeError: fetch failed
2024-08-15T07:10:23.2191069Z     at node:internal/deps/undici/undici:13178:13
2024-08-15T07:10:23.2191141Z     at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
2024-08-15T07:10:23.2191194Z     at async imageOptimizer (/app/node_modules/next/dist/server/image-optimizer.js:233:29)
2024-08-15T07:10:23.2191247Z     at async /app/node_modules/next/dist/server/next-server.js:157:72
2024-08-15T07:10:23.2191293Z     at async /app/node_modules/next/dist/server/response-cache.js:69:36 {
2024-08-15T07:10:23.2191340Z   [cause]: AggregateError [ECONNREFUSED]: 
2024-08-15T07:10:23.2191386Z       at internalConnectMultiple (node:net:1118:18)
2024-08-15T07:10:23.2191462Z       at afterConnectMultiple (node:net:1685:7)
2024-08-15T07:10:23.2191510Z       at TCPConnectWrap.callbackTrampoline (node:internal/async_hooks:130:17) {
2024-08-15T07:10:23.2191554Z     code: 'ECONNREFUSED',
2024-08-15T07:10:23.2191597Z     [errors]: [ [Error], [Error] ]
2024-08-15T07:10:23.2191636Z   }
2024-08-15T07:10:23.2191676Z }

TackleMcClean 🏅

08/15/2024, 7:17 AM

I'm in the process of trying to dig deeper on those errors, but since it's part of the black box that is nextjs I will have to build the frontend app and then slightly modify

/app/node_modules/next/dist/server/image-optimizer.js

line 233 to also log out what url it is trying to fetch, or at least catch the error, fully log it with more info, and then rethrow, not sure yet.

bielu

08/15/2024, 7:18 AM

I remember having similar issues with nuxt.js, but you sure your docker image locally have correctly setup networking to allow connection to other dockers/host?

kdx-perbol

08/15/2024, 7:18 AM

My spontaneous take is that this is a docker congestion issue... you are running on containers in Azure also, right?

bielu

08/15/2024, 7:18 AM

that's good question @kdx-perbol!

kdx-perbol

08/15/2024, 7:19 AM

Is there a non-Umbraco app in the solution as well? And if so, does that also get failed requests?

bielu

08/15/2024, 7:21 AM

also other question is nuxt and umbraco in same container of separate containers?

TackleMcClean 🏅

08/15/2024, 7:21 AM

Correct, the backend app is umbraco, running on Linux but still as a "raw" dotnet app, which seems to use docker in the background to run it on a debian container I think. And Nextjs is a docker app, so two azure web apps essentially. No other app in the mix, no failed requests anywhere else.

TackleMcClean 🏅

08/15/2024, 7:21 AM

I wonder, does docker have some sort of connection limit on it..

bielu

08/15/2024, 7:22 AM

I didnt experience limit of connections between docker locally, but i did experience limits on azure, so it has to be something in configuration 🤔

kdx-perbol

08/15/2024, 7:25 AM

It says above that problems were reproducible in development. Are you sure that's not the same problem? Or do you mean for other docker solutions?

TackleMcClean 🏅

08/15/2024, 7:26 AM

It is possible I'm looking at two different problems I realized today. However, if they are two different, they are still very similar, in that both cases return a ECONNREFUSED from Umbraco/backend. I will try this without docker, in dev environment, to rule out docker at least.

bielu

08/15/2024, 7:27 AM

@kdx-perbol i dont think it is same problem, noticed he mentioned he literary took down his app with 100% cpu usage, .net will queue requests and try handle them when resources are freed up but yeah when it is 100% it will start just refusing 😂 @TackleMcClean 🏅 is cpu usage on your webapps on 100% when you get dropouts?

TackleMcClean 🏅

08/15/2024, 7:30 AM

No, this is during the last hour, and we're seeing lots of errors currently https://cdn.discordapp.com/attachments/1273308975709814857/1273544449175191635/image.png?ex=66bf0031&is=66bdaeb1&hm=2c74be07efcb7cac258923b59f7d5ada93e2b25339e65357e877e4b16b18ff26&

bielu

08/15/2024, 7:31 AM

can you than confirm you have this settings setup on both your apps: WEBSITE_VNET_ROUTE_ALL setup? if not can you try?

TackleMcClean 🏅

08/15/2024, 7:33 AM

I don't have that right now, but I can try. Is there any risk this blocks outgoing traffic somehow?

bielu

08/15/2024, 7:34 AM

As far i know it shouldn;'t affect routing which is external connections excluding adding latency 🙂

bielu

08/15/2024, 7:35 AM

I would probably test it first in test env

TackleMcClean 🏅

08/15/2024, 7:36 AM

yeah I'm seeing if I can do this on the stage environment we have I see this in the vnet settings, is this the same as WEBSITE_VNET_ROUTE_ALL? https://cdn.discordapp.com/attachments/1273308975709814857/1273545738453778512/image.png?ex=66bf0165&is=66bdafe5&hm=9148f31c35a54fdef138d11b25df6c0bd0460506c50063fc2d65e0b672d5e30f&

TackleMcClean 🏅

08/15/2024, 7:36 AM

I don't have the WEBSITE_VNET_ROUTE_ALL env variable though

TackleMcClean 🏅

08/15/2024, 7:38 AM

> The existing WEBSITE_VNET_ROUTE_ALL app setting can still be used, and you can enable all traffic routing with either setting. So I think I'm already on that. But I can try on stage env to see if there's a difference.

TackleMcClean 🏅

08/15/2024, 7:52 AM

I can see on prod frontend app, there are 1219 counts of error 500 last 24 hours, all going to url

/_next/image/

which is nextjs image optimization endpoint, so that checks out at least.

7 Views

Previous Next

Umbraco

A hub and casual space for you to interact with fellow community members and learn more about Umbraco!