Monday, March 4, 2024

CapyPDF 0.9.0 released

I have just released CapyPDF 0.9.0. It can be obtained either via Github or PyPI.

There is no major big feature for this release. The most notable is probably the ability to create structured (or "tagged") PDF files. The code supports using both the builtin tags as well as defining your own. Feel free to try it, just know that the API is guaranteed to change.

As a visual example, here is the full Python source code for one of the unit tests.

When run it creates a tagged PDF file.  Adobe Acrobat reports that the document has the following logical structure.

As you can (hopefully) tell, structure and content are the same in both of them.

Friday, February 23, 2024

Creating tagged PDFs with CapyPDF now sort of possible

There are many open source PDF generators available. Unfortunately they all have some limitations when it comes to generating tagged PDFs

  • Cairo does not support tagged PDFs at all
  • LaTeX can create tagged PDFs, but obviously only out of LaTeX documents
  • Scribus does not support tagged PDF
  • LibreOffice does support tagged PDF generation, but its code is not available as a standalone library, it can only be used via LibreOffice
  • HexaPDF does not support tagged PDFs (though they seem to be on the roadmap), but it is implemented in Ruby so most projects can't use it
  • Firefox can print pages to PDF, but the result is not tagged, even though the PDF document structure model is almost exactly the same as in HTML

There does not seem to be a library that provides for all of this with a "plain C" API that can be used to easily generated tagged PDFs using almost any programming language.

There still isn't, but at least now CapyPDF can generate simple tagged PDF documents. A sample document can be downloaded via this link. Here is a picture showing the document structure in Acrobat Pro.

It should also work in things like screen readers and other accessibility tools, but I have not tested it.

None of this is exposed in the C API, because this has a fairly large API surface and I have not yet come up with a good way to represent it.


Sunday, January 21, 2024

CapyPDF 0.8.0 released

Version 0.8.0 of the CapyPDF library has been released. The main new feature is support for form XObjects and printer's mark annotations.

Printer's marks are things like color bars, crop marks and registration marks (also known as "bullseye marks") that high end printers need for quality control. Here is a very simple document.

An experienced print operator would use the black lines to set up paper trimming. Traditionally these marks were drawn in the page's graphics stream. This is problematic because nowadays printers prefer to use their own custom marks instead of ones created by the document author. PDF solves this issue by moving these graphics operations to separate draw contexts (specifically "form XObjects", which are not actually forms, though they are XObjects) that can then be "glued" on top of the page. These annotations are shown in PDF viewer applications but they are not printed. I have no experience with high end RIP software, but presumably the operator can choose to either print the document's annotations or replace them with their custom QA marks.

As usual, to find out what features CapyPDF has and how to use them, look up either the public C header or the Python code used in unit tests.

Tuesday, January 2, 2024

C++ module tooling emulator playground

Developing tooling for C++ modules is challenging to say the least. Module implementation maturity in compilers varies, they all work slightly (well massively) differently, there are bugs and you also need a code base that uses modules. Because of these and other reasons there are maybe five people in the entire world who even think about this issue. This is bad, because it is supposed to be future foundational technology. It would benefit from more eyes.

Something ought to be done about this. So I did.

I created a fake "module only" C++ compiler, a fake linker, a fake module scanner and fake project generator. The compiler does not produce any code. It only does three things:

  1. Reads export statements from sources and writes corresponding fake module files
  2. Reads import statements from source files.
  3. Validates that all module files a source file imports exist on disk at the time of invocation
This did not take much effort. In fact the whole project is approximately 300 lines of Python and can be obtained from this repository.

With this it is possible for anyone interested to try to come up with a module building scheme that does not require generating O(N²) command line arguments on the fly. The current scanner implementation is taken almost directly from Meson and it works one build target at a time as opposed to a single source file at a time. With this approach the overhead of scanning is one process invocation per build target. A per-source approach would take two processes per source scanning: one for invoking the compiler to generate the standard JSON-format dependency file and another for converting that to the dyndep format that Ninja understands.

The setup assumes that the compiler supports a "write module files to this directory" command line argument. It is mandatory to avoid generating compiler arguments dynamically.

Or maybe it isn't and there is a way to make it work in some other way. At least now the tooling is available so anyone reading this can try to solve this problem.

Tuesday, December 26, 2023

Tagged PDF funsies

HTML was originally designed as a file format that merely contains the logical structure of a document. End users could format it in a way that was most suitable for them. For example people with reading disabilities could make the text bigger or even use a screen reader. As time went on web site developers wanted pixel perfect control over the layout on end user machines whether this made sense or not. This lead to inventing a side channel to control layout. Since HTML was not originally designed for visual design, this lead to an impedance mismatch which caused a lot of work and headscratching to make it work. There is no "proper" solution so problems persist to this day.

PDF was originally designed as a file format for pixel perfect layout of graphics on every conceivable device. In this way people could be sure that their design was not randomly mangled along the way. As time went on people wanted to make PDF documents more broadly usable, for example to be able to copypaste text out of them and to expose the logical structure of the document to end users to the benefit of e.g. people with disabilities. This lead to inventing a side channel to describe structure but since PDF was not originally designed for semantic content, this lead to an impedance mismatch which caused a lot of work and headscratching to make it work. There is no "proper" solution so problems persist to this day.

Both of these formats also use JavaScript, but let's not go there.

In the case of PDF, the logical format is called "tagged PDF" and is implemented by writing magic tags and commands in the PDF stream to specify "marked content". A PDF generator also has to write many different dictionaries and arrays all of which have criss-cross-references to each other to make it work. Or that's my theory at least, since I have been unable to prove that CapyPDF's tagged PDF generation actually works. At best what can be said that no PDF processor I have used it with has reported errors.

Going through these lesser used parts of the file format teaches you quite quickly that the PDF specification varies wildly in quality. For example let's look at the aforementioned references. PDF has both dictionaries and arrays as native data types. Thus if you have to map arbitrary text keys to values you'd use a dictionary whereas mapping consecutive integers from zero upwards you'd use an array. Seems simple, right?

One of the data mappings needed for tagged PDF has gone beyond and reinvented this fairly simple structure. It has keys counting from zero upwards. Not only does the specification say that consecutive integers are needed, it even says that the PDF producer must write to a separate dictionary a key called ParentTreeNextKey. It goes on to say that when a new entry is added to this array (nee dictionary) then it must use the given key for the value. A more logical name for this would be ArraySize but that is not even the worst of it.

Said array is actually a key-value store where every other entry is the key (or "index") as an integer and every other entry is the corresponding value. Yes. this means that the contents of the array look like this: [ 0 "value0" 1 "value1" 2 "value2" ... ]. The actual values happen to also be index arrays, but they contain only values. In case you don't believe me, here is a screenshot from the official PDF spec.

Presumably the rationale is that you could leave some elements out from the array. A simpler approach would have been to store an empty array instead, but one should not meddle with the affairs of adobeans, for they are subtle and quick to anger.

Fortunately at least declaring a PDF as tagged is simple. There is a specific key in one of the metadata dictionaries and when that is set to true, the file is considered tagged.  Pdfinfo agrees with this assessment.

Good, good. Just to be sure, let's validate that it behaves the same on Acrobat Reader.

I wonder if I still have some leftover glögi?

Wednesday, December 20, 2023

Even more breakage in the C++ module world

In an earlier blog post we looked into some of the problems that the current C++ module implementation (specifically using Clang and CMake) has.  Yesterday it got a reply detailing a further, fairly serious problem.

In order to understand the issue, let's first look at the steps used in the current implementation:

  1. CMake runs, parses its input files and generates a Ninja build file
  2. As it is not possible (with the current implementation) to know the full command line up front, CMake adds a command line argument like @module_arg_file. Clang treats these kinds of arguments as "read the contents of the specified file and pretend that its contents was given as command line arguments".
  3. When the actual build starts, it first invokes CMake to run the module scanner, parse its output and generate dyndep info as well as the command line arguments for every single compilation task in the project.
  4. When this is done, compilation proceeds as usual.
The "works" in the sense that you can compile projects using it. But the problem is that there is also a file called compilation_commands.json, which is used by many C and C++ tools. It has all the same compilation commands as the Ninja file, just in an easy to parse JSON format. This means that if the corresponding module argument files have not been generated, all those compilation commands are broken.

Thus if you want to run any tooling that uses the compilation database, you first have to do a full build. Otherwise the compilation arguments are not guaranteed to be correct and the whole thing crashes like a house of cards. To make things even nastier, one of the main users of the compilation database is Clang's code completion functionality. If you check out a new project, you can not use code completion in your editor until you have successfully built the entire project. It's also not guaranteed to produce correct results if you have done any edits on said source files (because the list of modules imported/exported may have changed).

And if that was not bad enough, this is not a theoretical issue, bugs caused by this are already happening.

I repeat the suggestion made in the original post that this is not a good foundation to place the future of C++ modules on. We should move to a better one, such as the one outlined in the post.

Monday, December 18, 2023

AI silliness: getting to a no

One of the most annoying thing about LLM chatbots is that they all talk like american used car salespersons or people who work in sales departments of software companies. That is, the answer to every question is always yes. Somehow everything must be positive. This got me thinking: what would you  need to do to get a "no" answer from a chatbot.

Let's start simple.

Well that would have been too easy, I guess. Can we at least get it to give "no" as an answer?

Okay, let's refine this further.

Not only does the answer have a lot more words than one, it is actually incorrect. The correct answer is not false, it is "no". We'll ignore it for now and try to condense the answer.

No, I don't think that is the word I'm searching for and the answer is even longer than the previous one even though I asked for a shorter one. But no matter, let's be even more specific.

That is almost a single word answer but it is still not a "no". A change of tactics is needed. Rather than trying to come up with a working question on our own, we can ask the LLM to generate it for us.

Our poor AI overlord does not seem to have a good grasp on what is and is not a paradox. Or maybe it is trying to distract us from the fact that it can't give a negative reply. Let's give it a much easier question to answer instead.

At this point I gave up.