Working with PDFs in the Browser Using Modern Web APIs

Client-Side PDF Processing: A Modern Approach

The idea that PDF manipulation requires a server has become outdated. Modern JavaScript libraries combined with browser APIs like Canvas, Web Workers, and ArrayBuffers make it possible to render, merge, split, annotate, and extract text from PDFs entirely on the client side. This guide explores the two key libraries that make this possible: pdf.js for rendering and pdf-lib for manipulation.

Why Client-Side PDF Processing?

Traditional PDF processing involves uploading files to a server, processing them with tools like Ghostscript, ImageMagick, or server-side libraries, and returning the result. This approach has several drawbacks:

Privacy concerns: Uploading sensitive documents (contracts, medical records, financial statements) to a third-party server creates data exposure risks. With client-side processing, documents never leave the user's device.

Server costs: PDF processing is CPU-intensive. Handling thousands of concurrent PDF operations requires significant server resources. Client-side processing distributes this workload across users' browsers.

Latency: Uploading a 50MB PDF, processing it, and downloading the result takes seconds to minutes. Client-side processing is instant — there is no network roundtrip.

Offline capability: Client-side tools work without an internet connection, making them suitable for field workers, airline passengers, and restricted network environments.

pdf.js: Mozilla's PDF Rendering Engine

pdf.js is the open-source PDF rendering engine developed by Mozilla for Firefox's built-in PDF viewer. It can render any PDF document to an HTML Canvas element or extract the raw text content.

How pdf.js works:

Parsing: The library reads the PDF's binary structure — cross-reference tables, object streams, and page trees — using typed arrays (Uint8Array).
Font handling: PDF files embed fonts in various formats (Type 1, TrueType, CFF). pdf.js converts these to web-compatible formats for rendering.
Rendering: The page content stream (a sequence of drawing commands) is interpreted and rendered onto a Canvas 2D context — drawing text, paths, images, and gradients.
Text extraction: For searchable text, pdf.js extracts text content with position information, enabling text selection, search, and accessibility.

Basic usage:

import * as pdfjsLib from 'pdfjs-dist';

// Load a PDF from a file input
const fileBuffer = await file.arrayBuffer();
const pdf = await pdfjsLib.getDocument({ data: fileBuffer }).promise;

// Render page 1 to canvas
const page = await pdf.getPage(1);
const viewport = page.getViewport({ scale: 1.5 });
const canvas = document.createElement('canvas');
canvas.width = viewport.width;
canvas.height = viewport.height;
const ctx = canvas.getContext('2d');
await page.render({ canvasContext: ctx, viewport }).promise;

pdf-lib: Creating and Modifying PDFs

While pdf.js renders PDFs visually, pdf-lib creates and modifies PDF documents at the structural level. It can merge documents, add pages, embed images, fill form fields, and copy pages between documents.

Merging two PDFs:

import { PDFDocument } from 'pdf-lib';

const pdf1Bytes = await file1.arrayBuffer();
const pdf2Bytes = await file2.arrayBuffer();

const pdf1 = await PDFDocument.load(pdf1Bytes);
const pdf2 = await PDFDocument.load(pdf2Bytes);

const merged = await PDFDocument.create();
const pages1 = await merged.copyPages(pdf1, pdf1.getPageIndices());
pages1.forEach(page => merged.addPage(page));
const pages2 = await merged.copyPages(pdf2, pdf2.getPageIndices());
pages2.forEach(page => merged.addPage(page));

const mergedBytes = await merged.save();
// Trigger download...

Key capabilities of pdf-lib:

Create new PDF documents from scratch
Merge multiple PDFs into one
Copy specific pages between documents
Embed images (JPEG, PNG) into pages
Fill and read PDF form fields
Add metadata (title, author, keywords)
Set page sizes and rotations

Performance Considerations

Client-side PDF processing runs in the browser's JavaScript engine, which introduces performance constraints:

Memory: A 100-page PDF might consume 200-500MB of memory when fully loaded. For large documents, process pages sequentially rather than loading everything at once.

Web Workers: pdf.js supports Web Worker-based rendering, which moves the CPU-intensive parsing and rendering off the main thread. This prevents the UI from freezing during heavy operations:

pdfjsLib.GlobalWorkerOptions.workerSrc = '/pdf.worker.min.js';

Canvas size limits: Browsers impose limits on canvas dimensions (typically 16384x16384 pixels). High-DPI rendering of large pages may exceed these limits. Scale appropriately.

File size handling: Use the File API and ArrayBuffers for efficient binary data handling. Avoid converting entire files to Base64 strings, which increases memory usage by 33%.

Text Extraction: Making PDFs Searchable

Extracting text from PDFs is critical for search, accessibility, and content processing:

const page = await pdf.getPage(pageNum);
const textContent = await page.getTextContent();
const text = textContent.items
  .map(item => item.str)
  .join(' ');

However, text extraction has limitations:

Scanned PDFs: If the PDF is a scanned image (no embedded text layer), text extraction returns nothing. You need OCR (Optical Character Recognition) for scanned documents.
Complex layouts: Multi-column layouts, tables, and rotated text may extract in unexpected order.
Custom fonts: Some PDFs use custom font encodings that may produce garbled text output.

Security Considerations

Even with client-side processing, security matters:

Content Security Policy (CSP): pdf.js uses Web Workers and dynamically creates elements. Ensure your CSP allows worker-src and script-src for the pdf.js worker script.

Malicious PDFs: PDF is a complex format that has historically been used to exploit vulnerabilities. pdf.js sandboxes its rendering in the browser, but always process untrusted PDFs in an isolated context.

Encrypted PDFs: pdf-lib supports loading password-protected PDFs if you provide the correct password. The decryption happens in-browser using the PDF specification's encryption algorithms (RC4 or AES).

The Future of Client-Side PDF

WebAssembly (Wasm) is opening new possibilities for client-side PDF processing. Libraries originally written in C/C++ (like Poppler and MuPDF) are being compiled to Wasm, bringing near-native performance to the browser. This means operations that currently require a server — like OCR, advanced text extraction, and form flattening — are becoming feasible in the browser.

Summary

Modern JavaScript libraries make it possible to render, create, merge, split, and analyze PDF documents entirely in the browser. pdf.js handles rendering and text extraction, while pdf-lib handles structural manipulation. This approach provides superior privacy (files never leave the device), eliminates server costs, and works offline. As WebAssembly matures, even more advanced PDF operations will become client-side first.