I have been working a project for the last few days, that deals with rendering PDF's in-browser. Initially, I was going to parse the PDF and extract the text content, but then I ran into pdf.js, which is a library developed by Mozilla for rendering PDF's in-browser via JavaScript. The project I am working on has a requirement that users should be able to select text within the PDF. This is possible using pdf.js. Unfortunately, the example code only shows you how to render a PDF, but not how to enable text-selection. I wasn't able to find any API access to enable text-selection either. I finally ended up on the #pdfjs IRC channel and the friendly folks there gave me some direction. The logic for enabling text-selection was buried inside the code for Mozilla's PDF viewer, and was heavily intertwined with the viewer code as well. I spent a few days playing around with the viewer and tracing through the code. I was stumped many times since the code was complex and I know jack about parsing PDF's. But eventually I was able to focus on the part of the code that actually took care of enabling text-selection.
pdf.js' approach to enabling text-selection is actually quite clever. The library overlays divs over the PDF, and these divs contain text that matches the PDF text that they are floating over. So when you select the text, you are actually selecting the text inside the overlaid divs. This was fine and dandy, but I was still stuck as far as getting this to work on my project. What I needed was a minimal example that I could adapt for my uses. After a day or two of tracing code, experimenting, debugging, and staring at the screen in frustration, I was eventually able to come up with a minimal example! To accomplish this, I extracted code that was relevant to creating the overlays out of the viewer code, into its own independent file. I also removed a lot of code that was dependent on the viewer itself. Keep in mind that this example doesn't have functionality like text finding or matching, and that code is also heavily intertwined with the viewer code. All this example does is render a PDF with text-selection enabled. However, I think this is a good start!
If you are interested, you can check out the code on github and a working example on this fiddle.
The pertintent code is as follows (keep in mind you still require additional resources; all of that information is available on github):
window.onload = function () {
var pdfBase64 = "..."; //base64 representing the PDF
var scale = 1.5; //Set this to whatever you want. This is basically the "zoom" factor for the PDF.
/**
* Converts a base64 string into a Uint8Array
*/
function base64ToUint8Array(base64) {
var raw = atob(base64); //This is a native function that decodes a base64-encoded string.
var uint8Array = new Uint8Array(new ArrayBuffer(raw.length));
for (var i = 0; i < raw.length; i++) {
uint8Array[i] = raw.charCodeAt(i);
}
return uint8Array;
}
function loadPdf(pdfData) {
PDFJS.disableWorker = true; //Not using web workers. Not disabling results in an error. This line is
//missing in the example code for rendering a pdf.
var pdf = PDFJS.getDocument(pdfData);
pdf.then(renderPdf);
}
function renderPdf(pdf) {
pdf.getPage(1).then(renderPage);
}
function renderPage(page) {
var viewport = page.getViewport(scale);
var $canvas = jQuery("<canvas></canvas>");
//Set the canvas height and width to the height and width of the viewport
var canvas = $canvas.get(0);
var context = canvas.getContext("2d");
canvas.height = viewport.height;
canvas.width = viewport.width;
//Append the canvas to the pdf container div
var $pdfContainer = jQuery("#pdfContainer");
$pdfContainer.css("height", canvas.height + "px").css("width", canvas.width + "px");
$pdfContainer.append($canvas);
//The following few lines of code set up scaling on the context if we are on a HiDPI display
var outputScale = getOutputScale();
if (outputScale.scaled) {
var cssScale = 'scale(' + (1 / outputScale.sx) + ', ' +
(1 / outputScale.sy) + ')';
CustomStyle.setProp('transform', canvas, cssScale);
CustomStyle.setProp('transformOrigin', canvas, '0% 0%');
if ($textLayerDiv.get(0)) {
CustomStyle.setProp('transform', $textLayerDiv.get(0), cssScale);
CustomStyle.setProp('transformOrigin', $textLayerDiv.get(0), '0% 0%');
}
}
context._scaleX = outputScale.sx;
context._scaleY = outputScale.sy;
if (outputScale.scaled) {
context.scale(outputScale.sx, outputScale.sy);
}
var canvasOffset = $canvas.offset();
var $textLayerDiv = jQuery("<div />")
.addClass("textLayer")
.css("height", viewport.height + "px")
.css("width", viewport.width + "px")
.offset({
top: canvasOffset.top,
left: canvasOffset.left
});
$pdfContainer.append($textLayerDiv);
page.getTextContent().then(function (textContent) {
var textLayer = new TextLayerBuilder($textLayerDiv.get(0), 0); //The second zero is an index identifying
//the page. It is set to page.number - 1.
textLayer.setTextContent(textContent);
var renderContext = {
canvasContext: context,
viewport: viewport,
textLayer: textLayer
};
page.render(renderContext);
});
}
var pdfData = base64ToUint8Array(pdfBase64);
loadPdf(pdfData);
};
Rendering a PDF with text-selection, using pdf.js http://t.co/JVEozv9pph
Very helpful. Thank you.
wow i have been trying to figure out how to do this for a while. thank you!!!!!
any idea how to adding annotation into pdf js?
Hi,
Can you tell me how would you extract the text content? I need to parse a pdf in javascript without rendering.
Thanks,
Nikhil
Hi Vivin,
Great work on this. I can get this all working with v0.8.223, but as that version has issues with IE, I’m trying to get it working with the latest (v1.0.1118).
Have you since been able to get this to work with newer versions of PDFjs? It extracts the text properly, but never ends up populating the text layer.
Hi , your fiddler example doesn’t work for Safari (5.1.7) windows browser. Display text is like ‘XXXXXXXXXXX..etc’ . Do you have any work around ?
where is the pdf file
How to read page from pdf file and implement text selection ?
thanks,it’s save my life.
Hi Vivin:
This is very helpful. However, not sure if you know, the working example on fiddle does not render on Chrome or Safari on OSX at the time of writing.
Where is the function getOutputScale() defined?
its not working in IE