This is a precursor to moving the call into a
worker thread to let us use `OffscreenCanvas`. The
current position wouldn't work since we make
transformations to the canvas object after the
getContext call, which isn't allowed for
OffscreenCanvas. Also it isn't allowed to clone or
`transferControlToOffscreen` the canvas after the
`getContext` call.
This allows us to simply invoke `PDFWorker.create` unconditionally from the `getDocument` function, without having to manually check if a global `workerPort` is available first.
Currently *some* of the links[1] on page three of the `issue19835.pdf` test-case aren't clickable, since the destination (of the LinkAnnotation) becomes empty.
The reason is that these destinations include the character `\x1b`, which is interpreted as the start of a Unicode escape sequence specifying the language of the string; please refer to section [7.9.2.2 Text String Type](https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/PDF32000_2008.pdf#G6.1957385) in the PDF specification.
Hence it seems that we need a way to optionally disable that behaviour, to avoid a "badly" formatted string from becoming empty (or truncated), at least for cases where we are:
- Parsing named destinations[2] and URLs.
- Handling "strings" that are actually /Name-instances.
- Building a lookup Object/Map based on some PDF data-structure.
*NOTE:* The issue that prompted this patch is obviously related to destinations, however I've gone through the `src/core/` folder and updated various other `stringToPDFString` call-sites that (directly or indirectly) fit the categories listed above.
---
[1] Try clicking on anything on the line containing "Item 7A. Quantitative and Qualitative Disclosures About Market Risk 27".
[2] Unfortunately just skipping `stringToPDFString` in this case would cause other issues, such as the named destination becoming "unusable" in the viewer; see e.g. issues 14847 and 14864.
In the referenced PDF document the keys, in the /Dests dictionary, need to account for PDFDocEncoding.
To improve destination handling in general we'll now unconditionally fallback to always checking all destinations.
With this patch, all the paths components are collected in the worker until a path
operation is met (i.e., stroke, fill, ...).
Then in the canvas a Path2D is created and will replace the path data transfered from the worker,
this way when rescaling, the Path2D can be reused.
In term of performances, using Path2D is very slightly improving speed when scaling the canvas.
The new API-functionality will allow a PDF document to be downloaded in the viewer e.g. while the PasswordPrompt is open, or in cases when document initialization failed.
Normally the raw data of the PDF document would be accessed via the `PDFDocumentProxy.prototype.getData` method, however in these cases the `PDFDocumentProxy`-instance isn't available.
Currently we modify the EXIF-block in place, which may end up "breaking" the JPEG-data of the original PDF document since e.g. saving it from the viewer no longer contains the real EXIF-block.
Hence the EXIF-block replacement is moved into the `JpegStream` class, such that we can copy the data before doing the replacement.
Currently this rule is disabled in a number of spots across the code-base, and unless absolutely necessary we probably shouldn't disable linting, so let's just update the code to fix all the outstanding cases.
- Most of the these are only used in the `src/display/api.js` file, and this leads to slightly shorter code.
- A number of unit-tests need a `BaseCanvasFactory`-instance, however that one is available through the `PDFDocumentProxy`-instance nowadays.
- For other unit-tests the remaining necessary default Factory-definitions can be moved into the `test/unit/test_utils.js` file.
These old exceptions have a fair amount of overlap given how/where they are being used, which is likely because they were introduced at different points in time, hence we can shorten and simplify the code by replacing them with a more general `ResponseException` instead.
Besides an error message, the new `ResponseException` instances also include:
- A numeric `status` field containing the server response status, similar to the old `UnexpectedResponseException`.
- A boolean `missing` field, to allow easily detecting the situations where `MissingPDFException` was previously thrown.
The `/Root/AcroForm/Fields` array contains a "ridiculous" number of LinkAnnotations, which obviously makes no sense since those are not form fields.
To improve performance we'll thus ignore those when collecting the field objects.
Some tests rely on the presence of a server that serves PDF files.
When tests are run from a web browser, the test files and PDF files are
served by the same server (WebServer), but in Node.js that server is not
around.
Currently, the tests that depend on it start a minimal Node.js server
that re-implements part of the functionality from WebServer.
To avoid code duplication when tests depend on more complex behaviors,
this patch replaces createTemporaryNodeServer with the existing
WebServer, wrapped in a new test utility that has the same interface in
Node.js and non-Node.js environments (=TestPdfsServer).
This patch has been tested by running the refactored tests in the
following three configurations:
1. From the browser:
- http://localhost:8888/test/unit/unit_test.html?spec=api
- http://localhost:8888/test/unit/unit_test.html?spec=fetch_stream
2. Run specific tests directly with jasmine without legacy bundling:
`JASMINE_CONFIG_PATH=test/unit/clitests.json ./node_modules/.bin/jasmine --filter='^api|^fetch_stream'`
3. `gulp unittestcli`
test/unit/api_spec.js is the only JS file in the tree with trailing
whitespace. Because `trim_trailing_whitespace = true` in .editorconfig,
any editor supporting EditorConfig would trim whitespace when the file
is changed, which results in test failures.
This commit fixes the issue by trimming the trailing whitespace and
adjusting the test expectations.
The problem with the referenced PDF document has nothing to do with invalid dates, as the issue seems to suggest, but rather with the fact that it has neither an XRef table nor a trailer dictionary.
Given that crucial parts of the internal document structure is missing, you might argue that it's not really a PDF document.
In an attempt to support this kind of corruption, we'll simply iterate through all (previously found) XRef entries and pick one that *might* be a valid /Root dictionary.
There's obviously no guarantee that this works, and it might not be fast in larger PDF documents, but at least it cannot be any worse than *immediately* throwing `InvalidPDFException` as we previously did here.
*Please note:* I'm totally fine with this patch being rejected, since it's somewhat questionable if we should actually attempt to support "PDF documents" with this level of corruption.
This unifies the various factory-options, since it's consistent with `CMapReaderFactory`/`StandardFontDataFactory`, and ensures that any needed parameters will always be consistently provided when creating `CanvasFactory`/`FilterFactory`-instances.
As shown in the modified example this may simplify some custom implementations, since we now provide the ability to access the `CanvasFactory`-instance used with a particular `getDocument`-invocation.
The idea is to insert a span in the text layer with an aria-role set to img
and use the bounding box provided by the attribute field in the tag dict in
order to have non-null dimensions for the image to make it "visible".
This is possible thanks to features, i.e. private fields and in particular static initialization blocks, that didn't exist back when we started using classes in the code-base.
According to the PDF specification these destinations should have a zoom parameter, which may however be `null`, but it shouldn't be omitted; please see https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/PDF32000_2008.pdf#G11.2095870
Hence we try to work-around bad PDF generators by making the zoom parameter optional when validating explicit destinations in both the worker and the viewer.
According to the PDF specification these destinations should have a coordinate parameter, which may however be `null`, but it shouldn't be omitted; please see https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/PDF32000_2008.pdf#G11.2095870
Hence we try to work-around bad PDF generators by making the coordinate parameter optional when validating explicit destinations in both the worker and the viewer.
Add unit test to check compatability with such cmaps
In the PDF in issue 18099. the toUnicode cmap had a line to map the glyph char codes from 00 to 7F to the corresponding code points. The syntax to map a range of char codes to a range of unicode code points is
<start_char_code> <end_char_code> <start_unicode_codepoint>
As the unicode code points are supposed to be given in UTF-16 BE, the PDF's line SHOULD have probably read
<00> <7F> <0000>
Instead it omitted two leading zeros from the UTF-16 like this
<00> <7F> <00>
This confused PDF.js into mapping these character codes to the UTF-16 characters with the corresponding HIGH bytes (01 became \u0100, 02 became \u0200, et cetera), which ended up turning latin text in the PDF into chinese when it was copied
I'm not sure if the PDF spec actually allows PDFs to do this, but since there's at least one PDF in the wild that does and other PDF readers read it correctly, PDF.js should probably support this
This unit test fails occasionally (albeit much less than before thanks
to PR #17663), so we change the parsing time check's divisor to prevent
it from happening again. If the last page's rendering time is less than
or equal to 50% of the first page's rendering time that should be enough
proof that no worker thread re-parsing occurred while also providing a
wide enough range to avoid intermittents.
Note that the assertion is now equal to the one we already have in the
"caches image resources at the document/page level, with main-thread
copying of complex images (issue 11518)" unit test which seems to work
reliably so far.
- Move the definition of the `loadingParams` Object, to simplify the code.
- Add a unit-test, since none existed and the viewer depends on this functionality.
For images that failed to decode once we want to avoid a pointless round-trip to the main-thread, which could otherwise happen for globally cached images.
- These changes will allow a simpler way of implementing PR 17770.
- The /Lang attribute is fetched lazily, with the first `getTextContent` invocation. Given the existing worker-thread caching, this will thus only need to be done *once* per PDF document (and most PDFs don't included this data).
- This makes the /Lang attribute *directly available* in the `textLayer`, which has the following advantages:
- We don't need to block, and thus delay, overall viewer initialization on fetching it (nor pass it around throughout the viewer).
- Third-party users of the `textLayer` will automatically benefit from this, once we start actually using the /Lang attribute in PR 17770.
*Please note:* This also, importantly, means that the `text` reference-tests will then cover this code (which wouldn't otherwise have been the case).
- Check that the `filename` is actually a string, before parsing it further.
- Use proper "shadowing" in the `filename` getter.
- Add a bit more validation of the data in `pickPlatformItem`.
- Last, but not least, return both the original `filename` and the (path stripped) variant needed in the display-layer and viewer.