Batching Records during Data Egress

When working with large datasets, it's often necessary to retrieve records in a paginated manner to manage the load and ensure efficient data handling. The Flatfile API provides a straightforward method for fetching records across multiple pages, utilizing pageSize and pageNumber parameters. This approach helps in efficiently accessing large datasets without overwhelming your application or the API.

Understanding Pagination Parameters

pageSize: Defines the number of records to return in a single request. Setting an optimal pageSize helps balance the trade-off between the number of requests and the load each request places on the network and the application. The default pageSize for the GET Records endpoint is 10,000 records, though we support the retrieval of up to 50,000 records at a time.
pageNumber: Specifies the page number of the records to retrieve. This parameter is used to navigate through pages in the dataset.

Example Scenario: Retrieving Workbook Records

In the example provided, we're working within an event listener that triggers upon the submit action of a workbook. We aim to retrieve all records from each sheet of the workbook, handling data page by page to efficiently manage potentially large datasets.

Step-by-Step Process

Initialize and Acknowledge Job: Start by acknowledging the job to set its status to "executing". This is essential for ensuring the job starts executing when the button is clicked.
List Sheets and Initialize Record Storage: Obtain a list of all sheets within the workbook and prepare a storage structure for the records you're about to retrieve.
Iterate Through Each Sheet:
1. Retrieve the total count of records for each sheet to calculate the total pages required (totalPages).
2. Set up a loop to iterate through each page of records based on totalPages.
Fetch Records Page by Page:
1. For each page, use the API to fetch a page of records using the sheet ID, pageSize, and pageNumber.
2. Log the current page number to monitor the process and append the fetched records to your storage structure.
Update Job Progress: Optionally, calculate and update the job's progress after each page fetch. This helps in monitoring the overall progress of the job.
Submit Data: Once all records are retrieved and stored, submit the data to the desired endpoint. In this example, data is sent to a webhook.
Complete or Fail the Job: Depending on the outcome, mark the job as complete or fail it with an appropriate message.

Code Block Example

Here's an example code block demonstrating the key steps:

listener
    .filter({ job: "workbook:submitAction" })
    .on("job:ready", async (event: FlatfileEvent) => {
      
      // Retrieving information I need out of the event context to execute the code below
      const { jobId, workbookId } = event.context;
      try {

        // Acknowledging the job to move it to "executing" status
        await api.jobs.ack(jobId, {
          info: "Starting job to submit action to webhook.site",
          progress: 5,
        });

        // Collecting all sheet and record data from the Workbook
        const { data: sheets } = await api.sheets.list({ workbookId });
        const records: { [name: string]: any } = {};

        // Iterating through each sheet to obtain its records in a paginated manner
        for (const [index, sheet] of sheets.entries()) {

          // Retrieving the total record count for each sheet
          const totalRecords = (await api.sheets.getRecordCounts(sheet.id)).data
            .counts.total;

          // Setting number of records per chunk
          const pageSize = 1000;

          // Calculating total number of pages to fetch per sheet
          const totalPages = Math.ceil(totalRecords / pageSize);

          // Initializing an array to store records for each sheet
          records[`Sheet[${index}]`] = [];

          // Fetching and storing records page by page
          for (let pageNumber = 1; pageNumber <= totalPages; pageNumber++) {
            const pageData = await api.records.get(sheet.id, {
              pageSize,
              pageNumber,
            });

            // To monitor the process, this console.log() statement will log out each processed chunk per sheet
            console.log(`Page Number: ${pageNumber}`); // Data:`, pageData

            // Appending the records of the current page to the sheet's record array
            records[`Sheet[${index}]`].push(pageData);

            // Calculate the progress as a percentage of the current page number over the total pages
            // This will evenly distribute progress across pages
            const progress = Math.round((pageNumber / totalPages) * 100);

            // Acknowledge the job with the current progress
            await api.jobs.ack(jobId, { progress: progress });
          }
        }

        // Sending the data to our sample database (webhook.site URL defined inside of a "webhookReceiver" variable)
        const response = await axios.post(
          webhookReceiver,
          { records },
          { headers: { "Content-Type": "application/json" } }
        );

        // Completing the job once all records are sent to the webhookReceiver
        await api.jobs.complete(jobId, {
          outcome: {
            message: `Data was successfully submitted to Webhook.site. Go check it out at ${webhookReceiver}.`,
            acknowledge: true,
          },
        });
      } catch (error) {
        await api.jobs.fail(jobId, {
          outcome: {
            message:
              "Record Submission failed; please try again.",
          },
        });
      }
    });

Retrieving records in a paginated fashion is crucial for handling large datasets efficiently. By following the steps outlined above, you can implement a robust solution for fetching and managing data using the Flatfile API. Always ensure to adjust the pageSize according to your application's needs and the expected dataset sizes to optimize performance and user experience.