This was previously attempted in PR 13371, but had to be reverted because of issues related to SystemJS (which has since been removed).
Also, while unrelated, shortens an existing conditional assignment.
Given that there are multiple issues with `ImageDecoder` in Chromium browsers, affecting both BMP and JPEG images, for now we (by default) disable that functionality there to avoid problems.
This also means that we can remove the previously added, and separate, `isChrome` API-option.
This allows end-users to forcibly disable `ImageDecoder` usage, even if the browser appears to support it (similar to the pre-existing option for `OffscreenCanvas`).
After the binary CMap format had been added there were also some ideas about *maybe* providing other formats, see [here](https://github.com/mozilla/pdf.js/pull/8064#issuecomment-279730182), however that was over seven years ago and we still only use binary CMaps.
Hence it now seems reasonable to simplify the relevant code by removing `CMapCompressionType` and instead just use a boolean to indicate the type of the built-in CMaps.
Given that we've not shipped, nor used, anything except binary CMaps for years let's just cache them unconditionally (since that's a tiny bit less code).
The PDF document is clearly corrupt, since it has /FontFile2 entries that are Dictionaries which obviously isn't correct.
While there's obviously no guarantee that things will look perfect this way, actually rendering the text at all should be an improvement in general.
Add unit test to check compatability with such cmaps
In the PDF in issue 18099. the toUnicode cmap had a line to map the glyph char codes from 00 to 7F to the corresponding code points. The syntax to map a range of char codes to a range of unicode code points is
<start_char_code> <end_char_code> <start_unicode_codepoint>
As the unicode code points are supposed to be given in UTF-16 BE, the PDF's line SHOULD have probably read
<00> <7F> <0000>
Instead it omitted two leading zeros from the UTF-16 like this
<00> <7F> <00>
This confused PDF.js into mapping these character codes to the UTF-16 characters with the corresponding HIGH bytes (01 became \u0100, 02 became \u0200, et cetera), which ended up turning latin text in the PDF into chinese when it was copied
I'm not sure if the PDF spec actually allows PDFs to do this, but since there's at least one PDF in the wild that does and other PDF readers read it correctly, PDF.js should probably support this
Note that the referenced file is trivially corrupt, since it contains *two* PDF documents placed in the same file which doesn't make sense (and isn't how a PDF document should be updated).
However it's still a good idea to ensure that `loadFont` is able to handle errors when resolving References, since that allows us to invoke the existing fallback font handling.
For images that failed to decode once we want to avoid a pointless round-trip to the main-thread, which could otherwise happen for globally cached images.
- These changes will allow a simpler way of implementing PR 17770.
- The /Lang attribute is fetched lazily, with the first `getTextContent` invocation. Given the existing worker-thread caching, this will thus only need to be done *once* per PDF document (and most PDFs don't included this data).
- This makes the /Lang attribute *directly available* in the `textLayer`, which has the following advantages:
- We don't need to block, and thus delay, overall viewer initialization on fetching it (nor pass it around throughout the viewer).
- Third-party users of the `textLayer` will automatically benefit from this, once we start actually using the /Lang attribute in PR 17770.
*Please note:* This also, importantly, means that the `text` reference-tests will then cover this code (which wouldn't otherwise have been the case).
In PR 17428 this functionality was limited to "larger" images, to not affect performance negatively. However it turns out that it's also beneficial to consider more "complex" images, regardless of their size, that contain /SMask or /Mask data; see issue 11518.
This replaces our custom `PromiseCapability`-class with the new native `Promise.withResolvers()` functionality, which does *almost* the same thing[1]; please see https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Promise/withResolvers
The only difference is that `PromiseCapability` also had a `settled`-getter, which was however not widely used and the call-sites can either be removed or re-factored to avoid it. In particular:
- In `src/display/api.js` we can tweak the `PDFObjects`-class to use a "special" initial data-value and just compare against that, in order to replace the `settled`-state.
- In `web/app.js` we change the only case to manually track the `settled`-state, which should hopefully be OK given how this is being used.
- In `web/pdf_outline_viewer.js` we can remove the `settled`-checks, since the code should work just fine without it. The only thing that could potentially happen is that we try to `resolve` a Promise multiple times, which is however *not* a problem since the value of a Promise cannot be changed once fulfilled or rejected.
- In `web/pdf_viewer.js` we can remove the `settled`-checks, since the code should work fine without them:
- For the `_onePageRenderedCapability` case the `settled`-check is used in a `EventBus`-listener which is *removed* on its first (valid) invocation.
- For the `_pagesCapability` case the `settled`-check is used in a print-related helper that works just fine with "only" the other checks.
- In `test/unit/api_spec.js` we can change the few relevant cases to manually track the `settled`-state, since this is both simple and *test-only* code.
---
[1] In browsers/environments that lack native support, note [the compatibility data](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Promise/withResolvers#browser_compatibility), it'll be polyfilled via the `core-js` library (but only in `legacy` builds).
This should be a *tiny* bit more efficient, since it avoids parsing substrings that we don't care about.
*Please note:* I cannot find an ESLint rule to enforce this automatically.
It isn't really a fix for the mentioned bug but it slightly improve things.
In reducing the memory use, the time spent in the GC is reduced either.
The algorithm to compute the bounding box is the same as before but it has just
been rewritten to be more efficient.
This manually ignores some cases where the resulting auto-formatting would not, as far as I'm concerned, constitute a readability improvement or where we'd just end up with more overall indentation.
Please see https://eslint.org/docs/latest/rules/arrow-body-style