Performance of self hosted next-server degrades when (substantially) more pages are added.
Unanswered
Transvaal lion posted this in #help-forum
Transvaal lionOP
My server is running well and scoring 100's on most existing pages. But with my new page that generates 70.000 new pages the performance completly degrades. Some additional info:
- Having a speed index and first contentful paint under 1.0s. This is with my current 1090 pages total. On 15000 pages this degrades to 1.5-2s Speed Index. With 70000 pages this degrades to 4.5-7s.
- My VPS has 4 cores, 8gb ram, I'm also running a Node JS backend on it but the memory and CPU usage overall don't exceed ~60%.
- I can tell via
- The 70000 pages are never loaded and also don't get in scope via a <Link> theyre only accessible directly.
I cannot find any documentation or earlier issues saying anything about more pages causing more passive CPU usage. It'd make sense if my site had high traffic and these new pages had to be revalidated, but that's not the case.
I've attempted to tweak my next.config.mjs to but this didn't change much if anything:
- Having a speed index and first contentful paint under 1.0s. This is with my current 1090 pages total. On 15000 pages this degrades to 1.5-2s Speed Index. With 70000 pages this degrades to 4.5-7s.
- My VPS has 4 cores, 8gb ram, I'm also running a Node JS backend on it but the memory and CPU usage overall don't exceed ~60%.
- I can tell via
top that next-server goes from 20-30% CPU usage to 130%+ this is telling me it's doing single threaded work and async work. This happens when I run an unlighthouse test on my site. I'm only doing this for the 1090 existing pages.- The 70000 pages are never loaded and also don't get in scope via a <Link> theyre only accessible directly.
I cannot find any documentation or earlier issues saying anything about more pages causing more passive CPU usage. It'd make sense if my site had high traffic and these new pages had to be revalidated, but that's not the case.
I've attempted to tweak my next.config.mjs to but this didn't change much if anything:
/** @type {import('next').NextConfig} */
const nextConfig = {
cacheMaxMemorySize: 500 * 1024 * 1024,
swcMinify: true,
optimizeFonts: true,
async headers() {
return [
{
source: '/:path*', // This pattern matches all pages
headers: [
{
key: 'Cache-Control',
// Publicly cacheable, but must revalidate after 60 seconds
value: 'public, max-age=60, must-revalidate',
},
],
},
];
},
};
export default nextConfig;99 Replies
You might need to upgrade your server to higher specs
@Anay-208 You might need to upgrade your server to higher specs
Transvaal lionOP
I considered that but CPU usage and memory usage aren't near being capped. ~40% CPU total
~35% RAM total
Some SSD usage from the backend server fetching things
It's a 2.5ghz base Intel CPU.
My question would be then, what's the bottleneck? And what drives CPU usage up drastically from pages not even being used.
~35% RAM total
Some SSD usage from the backend server fetching things
It's a 2.5ghz base Intel CPU.
My question would be then, what's the bottleneck? And what drives CPU usage up drastically from pages not even being used.
Transvaal lionOP
Bump, if any1 has info
Transvaal lionOP
Bump
Transvaal lionOP
bump
Transvaal lionOP
bump, tldr, why high cpu usage and next-server response time increasing massively when you from 1k to 25k pages for example?
Common Murre
Why does ur server even generate 70_000 site?
Transvaal lionOP
Its not really helpful for this discussion, but theyre just the items in world of warcraft that are tradeable so which im tracking and showing market info for
@Common Murre Why does ur server even generate 70_000 site?
Transvaal lionOP
which seems like a fine next js usecase for me
Common Murre
Can't you render the pages dynamically at runtime?
@Common Murre Can't you render the pages dynamically at runtime?
Transvaal lionOP
Not sure if I follow what you question entails. Theyre ISR pages with a revalidate on them. Yes I want them to not be client side rendered and the amount of data required from the backend makes it take to long for SSR at runtime. Plus it would quickly become an undesirable amount of data fetching from the backend
My reason to make the site with next.js was to leverage ISR in the first place, but are you suggesting the large amount of ISR pages in general is the reason for slowing down unrelated pages and high CPU usage?
I think it's a valid point for consideration if it does alleviate the performance issues. But i'm not sure if SSR will give acceptable results. Is it a known fact though that more ISR pages makes the rest of the site slower at a certain point?
Honestly, might be better to ask in the github discussions, nobody here has tested nextjs to the extent you are doing.
Github is gonna lead to you getting an answer after a while, but since your question is intresting, some contributor might pick it up
Transvaal lionOP
Alright appreciate
@Transvaal lion Alright appreciate
if any of them respond, copy the discussions link, send it here and mark it as the ans
Transvaal lionOP
Will do
@Arinji if any of them respond, copy the discussions link, send it here and mark it as the ans
Transvaal lionOP
Think this includes all neccesities? https://github.com/vercel/next.js/discussions/63204
@Transvaal lion Think this includes all neccesities? <https://github.com/vercel/next.js/discussions/63204>
wow yes that shld be more than enough, add the page number in the title
you gotta clickbait the contributers into opening the discussion
Transvaal lionOP
smart man
Transvaal lionOP
hahaha, I mean im just super curious for the technical reason also and how to alleviate this
yea me to xD, its an intresting thing
Transvaal lionOP
surely next.js can actually handle sites of this size
Transvaal lionOP
bump
Stony gall
my quick guess is the page revalidation, have you tried the on-demand revalidation?
@Stony gall my quick guess is the page revalidation, have you tried the on-demand revalidation?
Transvaal lionOP
ye it was the same, i mean - the main crux of this is - why do pages that aren't even browsed to or inscope for a prefetch or anything affect other pages
what i didnt test is moving the pages to the app router
Transvaal lionOP
bump
Transvaal lionOP
bump
Transvaal lionOP
bump
Toyger
you can try log it with opentelemtry https://nextjs.org/docs/app/building-your-application/optimizing/open-telemetry maybe it will show current bottleneck
American Crow
upvoted on GH also interested in this
@Toyger you can try log it with opentelemtry https://nextjs.org/docs/app/building-your-application/optimizing/open-telemetry maybe it will show current bottleneck
Transvaal lionOP
Thank you, I'll attempt it when I'm next able and post the results here/to GH
@Toyger you can try log it with opentelemtry https://nextjs.org/docs/app/building-your-application/optimizing/open-telemetry maybe it will show current bottleneck
Transvaal lionOP
struggling to set it up tbh, unless I see it wrong you have to manually instrument parts?
@Transvaal lion struggling to set it up tbh, unless I see it wrong you have to manually instrument parts?
Toyger
they have ready collector for this https://nextjs.org/docs/app/building-your-application/optimizing/open-telemetry#testing-your-instrumentation
and documentation states that it can be used with @vercel/otel
and documentation states that it can be used with @vercel/otel
Using OpenTelemetry Collector
When you are deploying with OpenTelemetry Collector, you can use @vercel/otel. It will work both on Vercel and when self-hosted.
Transvaal lionOP
Yea, I set up the docker but not getting anything which made me think I had to add spans myself
Transvaal lionOP
Alright I got it
What am I looking for specifically?
@Transvaal lion What am I looking for specifically?
Toyger
theoretically it should have breakdown on which part take more time, so you'll find your bottleneck and try to investigate it
Transvaal lionOP
Yea but which one of the three is interesting to use with that, not sure if you have any experience with it?
Toyger
I worked with jaeger on custom backends, didn't tried it with nextjs yet, but based on their docs you should have enough info for investigating.
Transvaal lionOP
And yeah I can see them accumulate in Jaeger, running Unlighthouse now to stress the site a tiny bit
its occuring now too the inconsistency
But Jaeger doesn't seem to show the same kind of inconsistencies
Gharial
dumb question, but what are the disk latency and IO wait? also as you said its a VPS, are test from other times (morning, afternoon, night) different?
Transvaal lionOP
Good question, I tested neither of those but I'd probably have to shut my site down to give it a fair test.
Purely anecdotally, I did swap from npm to yarn pnp to get around ridiculous times it took to get the full node_modules built
So it may struggle with a lot of small files
Gharial
i dont want to navigate to the wrong section with this, but some virtualizors and providers set a tresshold for max read and write on storage, but also there could be a timing issue with accessing the storage behind the "host node" itself, which will scale with larger number of files. but this i think would mostly only apply to not memory cached files (read from disk, block, or other network attached storage)
this is something i learned , when i was a provider. so take it with some salt 😄
this is something i learned , when i was a provider. so take it with some salt 😄
Transvaal lionOP
Just technically I wonder how this would occur right. So there's a large amount of pages, that are not accessed. Why does the CPU usage on paths increase by the mere existence of them?
Anyways I added onto the GH discussion that I setup otel if there's any specific info anyone would want. Do you have recommendations for testing the disk latency and IO?
Gharial
There are benchmarks out there, but testing IO to factor it out on a VPS is somewhat unreliable and would have to be done over days and most cases in corresponding with the provider itself.
Transvaal lionOP
yea your hypothesis is that it's not CPU bottlenecked but possibly IO bottlenecked because its on a VPS where multiple tenants will be on the same hardware (SSD)
and if I happen to be in a shard where a lot of ppl are doing IO heavy operations it could affect my performance
it's obviously one of the more obscure issues that can result from a non dedicated VPS, but I would wonder how next.js could potentially trigger IO bottlenecks while mongodb performs really well which in the end also are just files on your disk
Gharial
tbh i dont know how nextjs handles the pattern matching in cache control, but if my theory is right, it tries to "search" all paths on disk, until the pattern is met, which would also make sense, that it only applies, when you add more "items" pages
Transvaal lionOP
no shot right
surely it just caches it in some kind of dictionary
since every path is an unique key
Transvaal lionOP
bump | GH discussion: https://github.com/vercel/next.js/discussions/63204
Transvaal lionOP
bump
Transvaal lionOP
bump
Transvaal lionOP
i'm sad, cause my websites performance is bad. so now I rhyme in the hopes that someone will make it fine.
Transvaal lionOP
bump
Transvaal lionOP
bump
Transvaal lionOP
bump
Transvaal lionOP
bump, i really don't know what to do except keep bumping this. I just want clarification if this is how it is. Am I supposed to self shard my server if that's what Vercel normally does on their side etc
Transvaal lionOP
bump
@Transvaal lion bump, i really don't know what to do except keep bumping this. I just want clarification if this is how it is. Am I supposed to self shard my server if that's what Vercel normally does on their side etc
Vercel kinda containerizes your app and scales it up and down based on usage, also any dynamic stuff happens inside serverless functions. Honestly have no idea what exactly you need to be able to keep up with 70k pages but ig upping the I/O and CPU specs would be a start and containerization might need to be put in place.
Transvaal lionOP
Yeah, but the main point of this question is. Why does the very existence of new ISR pages. That are not in scope, or browsed to. Completly deteroriate the performance of the site?
Sure, maybe cache performance goes down slightly. But we're talking just browsing to 1000 existing pages one by one. Getting slower on average by multiple seconds, just because more ISR pages exist that are not refreshed or browsed to in any way.
I'm not 100% sure but surely the router might be bogging things down? Maybe its struggling to navigate through all the cache. I am also not completely sure but maybe try starting up a issue on GitHub and asking for possible issues and maybe explanation there
Transvaal lionOP
I did
Transvaal lionOP
bump
Hi, so the issue is specific to Vercel ?
if yes that's their support you need to reach out
if not, you need to do some profiling to figure which method is taking so long
they have added some docs about memory profiling recently https://nextjs.org/docs/app/building-your-application/optimizing/memory-usage
otherwise it's a normal Node.js app so you can probably profile locally to see what happens
you are focusing on Lighthouse but that's much too late in the process
you need OpenTelemetry as suggested above or any kind of Node.js profiling
I am surprised that perf of a single page is affected by using ISR, but at a certain scale it's not surprising to find new bottlenecks
70k pages is not that common, usually your websites gets rebuilt long before you actually render a significant amount of pages
@Eric Burel Hi, so the issue is specific to Vercel ?
Transvaal lionOP
>Hi, so the issue is specific to Vercel ?
No it's self hosted.
>they have added some docs about memory profiling recently https://nextjs.org/docs/app/building-your-application/optimizing/memory-usage
And cool I didn't know about the new feature in 14.2 I'll try that out
>you are focusing on Lighthouse but that's much too late in the process
Well it's just the easiest indicator,
No it's self hosted.
>they have added some docs about memory profiling recently https://nextjs.org/docs/app/building-your-application/optimizing/memory-usage
And cool I didn't know about the new feature in 14.2 I'll try that out
>you are focusing on Lighthouse but that's much too late in the process
Well it's just the easiest indicator,
top also indicates that something weird is happening. I understand it's a node app in the back. Hence why the 140% CPU usage is about the limit it can consume. Some async processes + full main thread usage. But it's just crazy how the CPU goes from being unnoticeable to 140% just because other pages are added.@Eric Burel 70k pages is not that common, usually your websites gets rebuilt long before you actually render a significant amount of pages
Transvaal lionOP
Yea I understand, I slimmed it down to about 40k right now on my live production but as you can see from the details in my GH issue the issue seems to almost scale linear with the amount of pages.
@Eric Burel you need OpenTelemetry as suggested above or any kind of Node.js profiling
Transvaal lionOP
I implemented OpenTelemetry but failed to gain useful insights from it, which is probably due my lack of understanding how to properly utilize it.
@Transvaal lion I implemented OpenTelemetry but failed to gain useful insights from it, which is probably due my lack of understanding how to properly utilize it.
OpenTelemetry can be a bit too big indeed, it's made to monitor distributed systems
perhaps you'd want to use Node.js profiler, the problem with top and all is that you see performance but it's not made to locate issues/ do root cause analysis
the heap profiler could help, or any CPU profiling solutions for Node.js that would let you get a snapshot
I don't use that much but these features exists
Transvaal lionOP
Yeah, honestly I was hoping someone from Vercel would read this and would be like yeah I understand what's going on. And give concrete guidelines on what steps to take. The only thing I can think off myself is that the cache overhead gets out of hand. I'll try the memory profiling from 14.2 when I have time and report back the findings on the GH issue
Transvaal lionOP
Still lacking info on this topic. This also seems similar to:
https://github.com/vercel/next.js/discussions/58822
https://github.com/vercel/next.js/discussions/58676
https://github.com/vercel/next.js/discussions/58822
https://github.com/vercel/next.js/discussions/58676
Transvaal lionOP
Oh wow, I just experimented with:
output: 'standalone',
my build got DISGUSTINGLY slow, like actually 30x slower, I had to cancel my CI pipeline and limit my ISR pages from 26k to 1.6k, but now it's built there's almost no more CPU usage by next js server, the name in the
output: 'standalone',
my build got DISGUSTINGLY slow, like actually 30x slower, I had to cancel my CI pipeline and limit my ISR pages from 26k to 1.6k, but now it's built there's almost no more CPU usage by next js server, the name in the
top also changed and the speed index / random slow down completly disappeared