Not able to scrape text (node fetch nor axios) with serverless function
Answered
Pond loach posted this in #help-forum
Pond loachOP
When trying to make a request from the serverless function, i get this error, but it works fine on localhost
- info Loaded env from /var/task/.env
Received a POST request to api/context/link
Error: Runtime exited with error: signal: segmentation fault
Runtime.ExitError
lmk if you want to see the function or anything! Thank you!!
- info Loaded env from /var/task/.env
Received a POST request to api/context/link
Error: Runtime exited with error: signal: segmentation fault
Runtime.ExitError
lmk if you want to see the function or anything! Thank you!!
Answered by Z4NR34L
ok, I see that there are a few things that possible would not work in Vercel environment, for such jobs I'm using most basic fetch and
node-html-parser
, here is an example for one of my endpoints:import {
NextRequest,
NextResponse
} from "next/server";
import { parse } from 'node-html-parser';
export async function POST(request: NextRequest) {
const res = await fetch('https://example.com/smth');
const bodyHtml = parse(await res.text())
return NextResponse.json({
productName: bodyHtml.querySelector('.c-product-info__name')?.text,
offers: bodyHtml.querySelectorAll('.c-offers-list__cont:nth-child(2) > section').map((element) => ({
store: element.querySelector('.c-offer__shop-logo-cont')?.attributes['aria-label'].replace('Do obchodu ', ''),
price: element.querySelector('.c-offer__price')?.text
}))
})
}
42 Replies
Pond loachOP
anyone have any idea ðŸ™
Hi, could you provide us with this route's code? It would be hard to help without that in that case.
od repository link if it's public 😄
Pond loachOP
async function scrapeLink(
link: string
): Promise<{ siteTitle: string; siteContent: string }> {
try {
console.log("scraping link:", link);
const response = await axios.get(link, {
headers: {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3",
},
});
console.log("response:", response);
const dom = new JSDOM(response.data, { url: link });
let reader = new Readability(dom.window.document);
let site = reader.parse();
if (!site) {
throw new Error("Unable to parse article");
}
const siteTitle = site.title;
let oldSiteContent = site.content;
const $ = cheerio.load(oldSiteContent);
oldSiteContent = $("body").text();
console.log("oldSiteContent:", oldSiteContent);
const siteContent = oldSiteContent;
return { siteTitle, siteContent };
} catch (error) {
console.error("Error scraping the link:", error);
throw error; // Re-throw the error to be handled by the caller
}
}
@Z4NR34L
you are a legend sir
are you using edge runtime or node? 😄
Pond loachOP
i don't have edge defined at the top of the route
im using export const dynamic = "force-dynamic";
and I will need a whole route handler code
Pond loachOP
ok
ok, I see that there are a few things that possible would not work in Vercel environment, for such jobs I'm using most basic fetch and
node-html-parser
, here is an example for one of my endpoints:import {
NextRequest,
NextResponse
} from "next/server";
import { parse } from 'node-html-parser';
export async function POST(request: NextRequest) {
const res = await fetch('https://example.com/smth');
const bodyHtml = parse(await res.text())
return NextResponse.json({
productName: bodyHtml.querySelector('.c-product-info__name')?.text,
offers: bodyHtml.querySelectorAll('.c-offers-list__cont:nth-child(2) > section').map((element) => ({
store: element.querySelector('.c-offer__shop-logo-cont')?.attributes['aria-label'].replace('Do obchodu ', ''),
price: element.querySelector('.c-offer__price')?.text
}))
})
}
Answer
And 2nd thing - POST requests are not cached anyways, you don't need to use force-dynamic 😄
Pond loachOP
ohhh ty
trying it rn
I'm not using maxDuration as those functions are pretty fast 😄
and I'm not looping anything - TBH looping in serverless endpoints is pure pain 😄
Pond loachOP
ok working on localhost
my git commit is stuck, so will test on vercel in a sec
sure, let me know 😄
normal sites work
i forgot user agent !
ok i think i got it
one last test
Remember that there can be used various anti-bot solutions on websites that you fetch, some of them would not be possible to scrape from Vercel/AWS/GCP/Azure infrastructure 😄
Pond loachOP
also btw do u work for vercel or something
ofc!
I wish haha, I'm just Software Engineer with many projects on Vercel
Pond loachOP
nice
I'm working mostly with Business Intelligence class systems based on Next.js + Vercel
Pond loachOP
ok it works
oh very cool
🫡
i also found you on twitter ðŸ£
If you don't mind I will left here link to my blog, where you can find more about Next or Vercel haha
https://www.zanreal.net/blog
https://www.zanreal.net/blog
Pond loachOP
ok sweet i'll check it out
Happy coding! 😄
Pond loachOP
clean ass site damn
there is still rebranding in progress
Pond loachOP
looks very cool