Is there a way to make puppeteer scrape every specific time?

Harrison · October 23, 2024, 6:39am

I’m really new in using puppeteer and specially creating APIs. I’ve created an API to web scrape a table from webpages that contains information that I need and create a JSON from it, so I can use it to display it on the front-end. The issue I’m having now is on Heroku that for some reason when a minute pass it stops working. I make sure that the pages closes once it finish scrapping, closes the tab.

Here’s what I have:

const express = require('express');
const puppeteer = require('puppeteer');
const cors = require('cors');
const NodeCache = require('node-cache');
const dotenv = require('dotenv');

dotenv.config();

const app = express();
const PORT = process.env.PORT || 4000;
const CACHE_TTL = 300; // Cache for 5 minutes

const cache = new NodeCache({ stdTTL: CACHE_TTL });

app.use(cors());

app.get('/health', (req, res) => {
  res.status(200).json({ status: 'OK' });
});

app.get('/api/stations', async (req, res) => {
  const stationName = req.query.station || 'CENTRO';
  const cacheKey = `station_${stationName}`;

  try {
    // Check cache first
    const cachedData = cache.get(cacheKey);
    if (cachedData) {
      return res.json(cachedData);
    }

    const url = `http://aire.nl.gob.mx:81/SIMA2017reportes/ReporteDiariosimaIcars.php?estacion1=${stationName}`;

    const browser = await puppeteer.launch({
      args: [
        '--no-sandbox',
        '--disable-setuid-sandbox',
        '--disable-dev-shm-usage',
        '--single-process'
      ],
      executablePath: process.env.PUPPETEER_EXECUTABLE_PATH || null,
      headless: true,
    });
    const page = await browser.newPage();
    await page.goto(url, { waitUntil: 'networkidle2' });

    await page.waitForFunction(() => {
      const tbody = document.querySelector("#tablaIMK_wrapper tbody");
      return (
        tbody &&
        tbody.innerText.trim().length > 0 &&
        !tbody.innerText.includes("No datos")
      );
    }, { timeout: 60000 });

    const jsonData = await page.evaluate(() => {
      const rows = Array.from(document.querySelectorAll("#tablaIMK_wrapper tbody tr"));
      return rows.map((row) => {
        const cells = row.querySelectorAll("td");
        return {
          parametro: cells[0]?.innerText.trim() || '',
          valor: cells[1]?.innerText.trim() || '',
          descriptor: cells[2]?.innerText.trim() || '',
        };
      });
    });

    await browser.close();

    if (jsonData.length === 0) {
      return res.status(404).json({ message: 'No data.' });
    }

    const responseData = { station: stationName, data: jsonData };
    
    // Store in cache
    cache.set(cacheKey, responseData);

    res.json(responseData);
  } catch (error) {
    console.error('Error scraping data:', error);
    res.status(500).json({ error: 'Error scraping data' });
  }
});

process.on('uncaughtException', (error) => {
  console.error('Uncaught Exception:', error);
  process.exit(1);
});

process.on('unhandledRejection', (reason, promise) => {
  console.error('Unhandled Rejection at:', promise, 'reason:', reason);
});

app.listen(PORT, () => {
  console.log(`Server running on http://localhost:${PORT}`);
});

I’ve tried using puppeteer-clustering but it didn’t work at all, instead the app didn’t start at all. I know that you guys maybe will say to use a database to store the JSON from there and access there but I really want to try to have instant information at the moment the page updates.

Tereza · October 24, 2024, 5:58am

The issue you’re facing where your Heroku app stops working after a minute is likely related to Puppeteer resource usage combined with Heroku’s resource limits. Here are a few potential causes and solutions to explore:

Possible Causes:

Memory or CPU Limit Exceeded: Puppeteer, especially with web scraping, can be resource-intensive. Heroku’s free or hobby dynos have limited memory and CPU. If your scraping tasks are too demanding, Heroku might be stopping your app to free resources.
Timeout: Heroku imposes a 30-second timeout limit on the request-response cycle. If your Puppeteer process takes longer than 30 seconds to scrape the page, Heroku will terminate the request, resulting in a timeout error.
Heroku Dyno Sleeping: If you’re using Heroku’s free tier, dynos “sleep” after 30 minutes of inactivity, which might explain why your app stops after a while if it’s not being accessed frequently.

Solutions:

Reduce Puppeteer’s Resource Usage:

Use a minimal set of browser features: When launching Puppeteer, you can further optimize the performance by disabling unnecessary features:

const browser = await puppeteer.launch({
  args: [
    '--no-sandbox',
    '--disable-setuid-sandbox',
    '--disable-dev-shm-usage',
    '--disable-gpu', // Disable GPU to save resources
    '--single-process', 
  ],
  headless: true,
});

Use Puppeteer’s page.setRequestInterception to block unnecessary resources:

You can block images, CSS, and other resources that aren’t needed for scraping to speed up the process:

await page.setRequestInterception(true);
page.on('request', (request) => {
  if (['image', 'stylesheet', 'font'].includes(request.resourceType())) {
    request.abort();
  } else {
    request.continue();
  }
});

Increase the Dyno Timeout Limit: Heroku enforces a 30-second timeout for each HTTP request. If your web scraping is taking longer than 30 seconds, you’ll get a timeout error. You can mitigate this by offloading scraping to a background worker, such as using a queue with Heroku worker dynos (e.g., Redis + Bull/Agenda) or Heroku Scheduler.
Optimize Puppeteer Operations: Ensure that the webpage is loaded as quickly as possible:

Consider networkidle0 or networkidle2 to wait for fewer network requests to settle before starting scraping.
Use waitForSelector on specific elements to ensure only necessary content is fully loaded.

Switch to a Paid Plan: If you’re on Heroku’s free or hobby plan, consider upgrading to a paid plan, which provides more consistent resources and doesn’t sleep dynos. This might help avoid issues with resource limits.
Limit the Number of Requests: Ensure you’re not scraping too many pages too frequently, as this can cause throttling or slow down your app. Adding rate limits or intervals between scraping requests may help.
Cache Results for a Short Period: Since you’re using NodeCache, consider caching the results for a shorter time (e.g., 1-5 minutes). This will reduce the frequency of scraping and help prevent performance issues. You are already caching the data, but perhaps reduce the time-to-live (TTL) further if needed.

Let me know if these suggestions help or if you encounter more specific issues!