Contributing to iText

Posted on Nov 30, 2015

Programming with PDFs

Recently I’ve been working on a coding project in my spare time where I stumbled upon a bug in an open source library I was using. As I’m a software developer who works on a few open source projects already I figured I should investigate and try to fix the problem myself and contribute back.

As some people can be wary of contributing to an established open source project I figured this post detailing what I went through may encourage others to take part too.

The project I was working on was written in Java and used a library called iText to parse a PDF file. The PDF file I was using had been made by someone else and wasn’t going to be subject to change by myself.

PDF details

As seen above the file was created using Adobe InDesign using the Adobe PDF Library to create the file for the PDF 1.7 specification. Using Adobe software makes me think that whatever the output it’s likely to conform to the PDF spec so any issues I was having would be with my own code or perhaps the PDF parsing library.

Background on the PDF format

The PDF spec, since version 1.3, includes support for page labels which allow you to label the pages with things other than just numbers. Page labels are defined in the PDF file as a “Number Tree” where you can define page ranges, numbering style (decimal digits, roman numerals, letters), text prefixes and adjust what number to start the range from. See section 12.4.2 in the PDF 1.7 specification if you want to know more, though be warned that the PDF spec can be quite heavy reading.

The PDF I was parsing was originally created to be published in book form and included an introductory preface, the book content itself and, nestled part way through another sub-book which also included its own introductory preface.

Pages diagram

I’ve attempted an illustration above of the general book layout (ranges not to scale) showing that the first four pages (preface) were roman numerals (i - iv), then the book started with pages labelled 1 - 414 (PDF pages 5 - 418) before the sub-book started with page prefixes of G with roman numerals again Gi - Gii (PDF pages 419 - 420). The sub book then goes to digits with the G prefix G1 - G22 (421 - 442) and then finally going back to the main book with just digits again picking up where it left off with 415 - 740 (PDF pages 443 - 768).

This page label information was included in the PDF copy of the book using the page labels number tree as talked about earlier.

Page labels

Above is an actual screenshot of the page labels dictionary from the PDF expanded in the iText RUPS tool which lets you inspect and explore PDF files from their definition. The \S values define the numbering style (\r being lower case roman numerals, \D being regular digits) and the \P values define the prefix (though ignore the first two characters displayed in RUPS, just the G is the actual prefix) and the \ST value defines which number to start from.

Issue

The project I’ve been working on needed a mapping of pages to page labels and thankfully for me iText already has a method PdfPageLabels.getPageLabels() which returns an array of strings for each PDF page.

Unfortunately for me when I used the function with my PDF it seemed the prefix (G) from the sub-book continued to be there once the sub-book finished. This meant PDF pages 442 and 443 were labelled G22 and G415 instead of G22 and 415.

All PDF viewers I tried, Adobe Acrobat Reader, Sumatra, PDF.js and Foxit showed the page labels correctly however and reading the PDF 1.7 spec it seemed like /P values defining the prefix would only define the prefix for the range /P is in.

The fix

Thankfully for me the iText function was static so it was easy for me to copy the whole function to my own code and test changes there instead of modifying the library directly.

After checking the PDF spec against the code it all seemed fine except for cases where the prefix wasn’t defined. All that needed to change was when the prefix isn’t defined for a given range, unset the prefix variable so that whatever value it may have been set to before doesn’t persist into the next range.

As the issue was fairly obscure I thought I should create a unit test to make sure that my fix worked and so that any future developers changing that area of the code could safely make changes without worrying about regression. As I didn’t own the distribution rights to the PDF I was using I wrote some more Java using iText to generate a test PDF for me.

To do the fix I followed the contributing guidelines that iText have defined and you can see my commit here: https://github.com/itext/itextpdf/commit/1d9eaf3b63ea9321154fcae9c67a8780fd9f0087

After a brief bit of fun with git branches I finally got my pull request accepted by the iText team and now everyone using iText 5.5.8 and above will enjoy accurate page labels.