How can I troubleshoot Umbraco/dotnet refusing connections? Does Umbraco rate limit by default?
t
We have a large site, thousands of pages, previously running linux dotnet on Ubuntu VPS, now running on also linux dotnet but in Azure. Our separate Nextjs frontend is making requests to Umbraco, both to UHeadless graphql and to
/media
images for Nextjs next-image consumption. We seem to very often overload Umbraco/dotnet, resulting in Nextjs
fetch
requests that returns:
Copy code
TypeError: fetch failed
    at Object.fetch (node:internal/deps/undici/undici:11372:11)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async executor (webpack-internal:///./src/functions/graphql/getRemoteSchema.js:16:29)
    at async Promise.all (index 1)
    at async Promise.all (index 1)
    at async Promise.all (index 1)
    at async Promise.all (index 0)
    at async Promise.all (index 3)
    at async Promise.all (index 4)
    at async Promise.all (index 124) {
  cause: Error: connect ECONNREFUSED 192.168.65.254:8080
      at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1595:16) {
    errno: -111,
    code: 'ECONNREFUSED',
    syscall: 'connect',
    address: '192.168.65.254',
    port: 8080
  }
}
This is actually from local dev environment so we can recreate this. But prod shows similar as well. Does anyone know how to troubleshoot something like this? I don't know if either dotnet by default, or Umbraco, rate limits requests? Our RAM and CPU consumption is all fine below 50% on both. No error output from the
dotnet run
process. No errors in `umbraco/Logs`/Log viewer either. We are at a loss with intermittent site outage. Any suggestion is appreciated!
d
Can you also reproduce this behaviour without nextjs?
b
is it on azure webapps? if so, you can solve it only by usage of vnet, i experience quite a lot issues when using Nuxt SRR with .net api on azure, there is limit of connections through public api and only work around is to use vnets 😬
t
Yes we're actually already using a vnet, so I don't think we've run into that issue. Plus we can trigger the error on local dev machines.
Great question, I'm trying to set up a somewhat isolated scenario. Have to have some form of automation to make all the requests. Will try to verify this.
d
Do you know k6? I use it if I want to fire a lot of requests to a server. It's a javascript based cli tool: https://k6.io/
t
Haven't tried that. I was going to try out if Postman can trigger lots of requests, and I've also used Locust, a python based load tester
b
@TackleMcClean 🏅 are you sure you setup app to force all trafić through vent? As by default it will use public network even if private is available
t
Hmm, then probably no, we've only made sure frontend and backend share a vnet, will investigate this when I've drained the local machine avenues of investigating
Good tip!
b
I spent good amount of time debugging it just to learn it, don't remember setting now but yeah it was painful to find it
It was WEBSITE_VNET_ROUTE_ALL=1 but I can see azure have new not well documented options for it 😂😂
t
Trying locust, to fetch both my sitemap and a random image I can trigger failed requests, and it clearly seems to be a "connection refused" as a result. This is not a 1:1 scenario, and in this case my desktop here is running at 100% cpu, so it's very synthetic, but it does tell one thing: if the system is overloaded you might see connection refused as a response. Still, the production environment rarely goes over 50% load, but maybe it's a matter of us DDOS:ing ourselves..
I'm just so confused to why the dotnet process can't indicate this is happening whatsoever, it just fails quitely.
If CPU is running at 100% ,should I expect Umbraco to drop connections? Or should I expect asp.net to drop connections? Or not at all?
n
In that scenario, I would expect the OS or dotnet to be unable to accept the incoming connecitions so they are just "lost". The OS has no resources for it to provide to the incoming connection so it would just fail. I've seen this before when loading testing sites and one or more resource on the host maxes out.
t
Is this the expected approach though? I would expect it to simply hold these connections in a queue and then process them one by one. Would putting a load balancer like nginx in front solve the issue maybe?
k
WEBSITE_VNET_ROUTE_ALL
routes all outgoing traffic through the virtual network. https://learn.microsoft.com/en-us/azure/app-service/configure-vnet-integration-routing
How are you running Umbraco in development? In a container? With IISExpress/Kestrel?
I've never seen this happen even on heavily loaded Umbraco sites, so I think there's something unusual about your setup somehow.
b
@kdx-perbol it will affect also internal connections app to app and it is only way force internal traffic to do it.
t
Through windows powershell, in windows, I run
dotnet run
in my project folder. From what I know this only invokes Kestrel.
k
It does. And you get "connection refused" from the frontend app when doing this? And you're sure those requests reach Kestrel?
Is "UHeadless graphql" on :8080? Is it only those requests that are affected? Or requests to Umbraco as well?
t
Well, I guess that depends on how we define "reach", it is supposed to reach it but from what I've read, a connection refused basically means the connection is never established. For example, here's nextjs trying to connect to Umbraco, but the fetch fails with:
Copy code
TypeError: fetch failed
    at Object.fetch (node:internal/deps/undici/undici:11372:11)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async executor (webpack-internal:///./src/functions/graphql/getRemoteSchema.js:19:29)
    at async Promise.all (index 1)
    at async Promise.all (index 1)
    at async Promise.all (index 1)
    at async Promise.all (index 0)
    at async Promise.all (index 3)
    at async Promise.all (index 4)
    at async Promise.all (index 93) {
  cause: Error: connect ECONNREFUSED 192.168.65.254:8080
      at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1595:16) {
    errno: -111,
    code: 'ECONNREFUSED',
    syscall: 'connect',
    address: '192.168.65.254',
    port: 8080
  }
}
Yes Uheadless graphql responds on the same port, it's its own route in dotnet/umbraco
When overloading the site with Locust, I get connection refused, and my sample request urls are only images and a sitemap (razor template), so in that case nextjs/nodejs is not in the equation as well.
b
192.168.65.254 is internall address i just noticed, interesting so your app is blocked on internal connection 🤔 , also if it is in azure why it would use 8080? webapp expose only 80 or ssl 443?
t
@bielu Sorry I forgot to mention, 192.168.65.254 is basically
host.docker.internal
, so the error above comes from a docker container talking to my "real" localhost, that's why you see that IP address. From my browser I reach Umbraco from
localhost:8080
This error above is from local development
To give an example of errors happening in prod while we're talking, we are 99% of cases since yesterday seeing connection refused when nextjs is trying to get images from umbraco. Then it looks like:
Copy code
2024-08-15T07:10:23.2190203Z TypeError: fetch failed
2024-08-15T07:10:23.2191069Z     at node:internal/deps/undici/undici:13178:13
2024-08-15T07:10:23.2191141Z     at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
2024-08-15T07:10:23.2191194Z     at async imageOptimizer (/app/node_modules/next/dist/server/image-optimizer.js:233:29)
2024-08-15T07:10:23.2191247Z     at async /app/node_modules/next/dist/server/next-server.js:157:72
2024-08-15T07:10:23.2191293Z     at async /app/node_modules/next/dist/server/response-cache.js:69:36 {
2024-08-15T07:10:23.2191340Z   [cause]: AggregateError [ECONNREFUSED]: 
2024-08-15T07:10:23.2191386Z       at internalConnectMultiple (node:net:1118:18)
2024-08-15T07:10:23.2191462Z       at afterConnectMultiple (node:net:1685:7)
2024-08-15T07:10:23.2191510Z       at TCPConnectWrap.callbackTrampoline (node:internal/async_hooks:130:17) {
2024-08-15T07:10:23.2191554Z     code: 'ECONNREFUSED',
2024-08-15T07:10:23.2191597Z     [errors]: [ [Error], [Error] ]
2024-08-15T07:10:23.2191636Z   }
2024-08-15T07:10:23.2191676Z }
I'm in the process of trying to dig deeper on those errors, but since it's part of the black box that is nextjs I will have to build the frontend app and then slightly modify
/app/node_modules/next/dist/server/image-optimizer.js
line 233 to also log out what url it is trying to fetch, or at least catch the error, fully log it with more info, and then rethrow, not sure yet.
b
I remember having similar issues with nuxt.js, but you sure your docker image locally have correctly setup networking to allow connection to other dockers/host?
k
My spontaneous take is that this is a docker congestion issue... you are running on containers in Azure also, right?
b
that's good question @kdx-perbol!
k
Is there a non-Umbraco app in the solution as well? And if so, does that also get failed requests?
b
also other question is nuxt and umbraco in same container of separate containers?
t
Correct, the backend app is umbraco, running on Linux but still as a "raw" dotnet app, which seems to use docker in the background to run it on a debian container I think. And Nextjs is a docker app, so two azure web apps essentially. No other app in the mix, no failed requests anywhere else.
I wonder, does docker have some sort of connection limit on it..
b
I didnt experience limit of connections between docker locally, but i did experience limits on azure, so it has to be something in configuration 🤔
k
It says above that problems were reproducible in development. Are you sure that's not the same problem? Or do you mean for other docker solutions?
t
It is possible I'm looking at two different problems I realized today. However, if they are two different, they are still very similar, in that both cases return a ECONNREFUSED from Umbraco/backend. I will try this without docker, in dev environment, to rule out docker at least.
b
@kdx-perbol i dont think it is same problem, noticed he mentioned he literary took down his app with 100% cpu usage, .net will queue requests and try handle them when resources are freed up but yeah when it is 100% it will start just refusing 😂 @TackleMcClean 🏅 is cpu usage on your webapps on 100% when you get dropouts?
b
can you than confirm you have this settings setup on both your apps: WEBSITE_VNET_ROUTE_ALL setup? if not can you try?
t
I don't have that right now, but I can try. Is there any risk this blocks outgoing traffic somehow?
b
As far i know it shouldn;'t affect routing which is external connections excluding adding latency 🙂
I would probably test it first in test env
t
yeah I'm seeing if I can do this on the stage environment we have I see this in the vnet settings, is this the same as WEBSITE_VNET_ROUTE_ALL? https://cdn.discordapp.com/attachments/1273308975709814857/1273545738453778512/image.png?ex=66bf0165&is=66bdafe5&hm=9148f31c35a54fdef138d11b25df6c0bd0460506c50063fc2d65e0b672d5e30f&
I don't have the WEBSITE_VNET_ROUTE_ALL env variable though
> The existing WEBSITE_VNET_ROUTE_ALL app setting can still be used, and you can enable all traffic routing with either setting. So I think I'm already on that. But I can try on stage env to see if there's a difference.
I can see on prod frontend app, there are 1219 counts of error 500 last 24 hours, all going to url
/_next/image/
which is nextjs image optimization endpoint, so that checks out at least.
7 Views