Beanz/pdf.js - pdf.js - Gitea: Git with a cup of tea

Beanz/pdf.js

Author	SHA1	Message	Date
Jonas Jenwald	2c0cc48d1b	Replace the `forEach` method in `Dict` with "proper" iteration support	2024-11-17 12:45:32 +01:00
Jonas Jenwald	805f962181	Reduce duplication when collecting optional content groups After PR 18825 we can easily "compute" the optional content groups, and can thus avoid tracking them manually.	2024-10-15 13:20:30 +02:00
Alexander Grahn	441efe456e	Optional Content (OC) radiobutton (RB) groups implemented. Resolves #18823 . The code parses the /RBGroups entry in the OC configuration dict and adds the property `rbGroups' to instances of the OptionalContentGroup class. rbGroups takes an array of Sets, where each Set instance represents an RB group the OptionalContentGroup instance is a member of. Such a Set instance contains all OCG ids within the corresponding RB group. RB groups an OCG is associated with are processed when its visibility is set to true, as required by the PDF spec.	2024-10-15 11:34:45 +02:00
Richard Smith (smir)	a67b9aec6c	Send fetch requests for all page dict lookups in parallel - When adding page dict candidates to the lookup tree, also initiate fetching them from xref, so if they are not yet loaded at all, the XHR will be sent - Only at the top level - assume that if there is a /Pages tree, it is sensibly structured and the number of requests won't be too bad - We can then await on the cached Promise without making the requests pipeline - This has a significant performance improvement for load-on-demand (i.e. with auto-fetch turned off) when a PDF has a large number of pages in the top level /Pages collection, and those pages are spread through a file, so every candidate needs to be fetched separately - PDFs with many pages where each page is a big image and all the pages are at the top level are quite a common output for digitisation programmes - I would have liked to do something like "if it's the top level collection and page count = number of kids, then just fetch that page without traversing the tree" but unfortunately I agree with comments on #8088 that there is no good general solution to allow for /Pages nodes with empty /Kids arrays	2024-08-21 11:08:14 +01:00
Jonas Jenwald	d24a61c648	Allow /XYZ destinations without zoom parameter (issue 18408) According to the PDF specification these destinations should have a zoom parameter, which may however be `null`, but it shouldn't be omitted; please see https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/PDF32000_2008.pdf#G11.2095870 Hence we try to work-around bad PDF generators by making the zoom parameter optional when validating explicit destinations in both the worker and the viewer.	2024-07-18 13:29:32 +02:00
Jonas Jenwald	403d023617	Allow e.g. /FitH destinations without additional parameter (bug 1907000) According to the PDF specification these destinations should have a coordinate parameter, which may however be `null`, but it shouldn't be omitted; please see https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/PDF32000_2008.pdf#G11.2095870 Hence we try to work-around bad PDF generators by making the coordinate parameter optional when validating explicit destinations in both the worker and the viewer.	2024-07-11 10:36:44 +02:00
Jonas Jenwald	6d523c316c	[api-minor] Include the document /Lang attribute in the textContent-data - These changes will allow a simpler way of implementing PR 17770. - The /Lang attribute is fetched lazily, with the first `getTextContent` invocation. Given the existing worker-thread caching, this will thus only need to be done once per PDF document (and most PDFs don't included this data). - This makes the /Lang attribute directly available in the `textLayer`, which has the following advantages: - We don't need to block, and thus delay, overall viewer initialization on fetching it (nor pass it around throughout the viewer). - Third-party users of the `textLayer` will automatically benefit from this, once we start actually using the /Lang attribute in PR 17770. Please note: This also, importantly, means that the `text` reference-tests will then cover this code (which wouldn't otherwise have been the case).	2024-05-14 12:44:41 +02:00
Jonas Jenwald	52f7ff155d	Validate even more dictionary properties This checks primarily Arrays, but also some other properties, that we'll end up sending (sometimes indirectly) to the main-thread.	2024-05-03 22:37:14 +02:00
Jonas Jenwald	2b69fb76ac	[api-minor] Improve the `FileSpec` implementation - Check that the `filename` is actually a string, before parsing it further. - Use proper "shadowing" in the `filename` getter. - Add a bit more validation of the data in `pickPlatformItem`. - Last, but not least, return both the original `filename` and the (path stripped) variant needed in the display-layer and viewer.	2024-05-01 18:02:05 +02:00
Jonas Jenwald	7206d0a237	Validate explicit destinations on the worker-thread to prevent `DataCloneError` (issue 17981) Note: This borrows a helper function from the viewer, however the code cannot be directly shared since the worker-thread has access to various primitives.	2024-04-22 22:51:35 +02:00
Calixte Denizet	136c1faa7f	Display outlines even if one has no title Fixes #17856.	2024-03-29 21:30:24 +01:00
Jonas Jenwald	0d039937f9	Add better support for /Launch actions with /FileSpec dictionaries (issue 17846)	2024-03-26 20:15:48 +01:00
Jonas Jenwald	3c78ff5fb0	[api-minor] Implement basic support for OptionalContent `Usage` dicts (issue 5764, bug 1826783) The following are some highlights of this patch: - In the Worker we only extract a subset of the potential contents of the `Usage` dictionary, to avoid having to implement/test a bunch of code that'd be completely unused in the viewer. - In order to still allow the user to manually override the default visible layers in the viewer, the viewable/printable state is purposely not enforced during initialization in the `OptionalContentConfig` constructor. - Printing will now always use the default visible layers, rather than using the same state as the viewer (as was the case previously). This ensures that the printing-output will correctly take the `Usage` dictionary into account, and in practice toggling of visible layers rarely seem to be necessary except in the viewer itself (if at all).[1] --- [1] In the unlikely case that it'd ever be deemed necessary to support fine-grained control of optional content visibility during printing, some new (additional) UI would likely be needed to support that case.	2024-03-12 13:18:15 +01:00
Jonas Jenwald	f9a384d711	Enable the `arrow-body-style` ESLint rule This manually ignores some cases where the resulting auto-formatting would not, as far as I'm concerned, constitute a readability improvement or where we'd just end up with more overall indentation. Please see https://eslint.org/docs/latest/rules/arrow-body-style	2024-01-21 16:20:55 +01:00
Calixte Denizet	0c38c6e103	Improve performance of optional content parsing	2023-10-25 17:50:53 +02:00
Jonas Jenwald	bf9c33e60f	Add support for "GoToE" actions with destinations (issue 17056) This shouldn't be very common in practice, since "GoToE" actions themselves seem quite uncommon; see PR 15537.	2023-10-04 11:14:23 +02:00
Calixte Denizet	a8573d4e1b	[Editor] Add the ability to create/update the structure tree when saving a pdf containing newly added annotations (bug 1845087) When there is no tree, the tags for the new annotions are just put under the root element. When there is a tree, we insert the new tags at the right place in using the value of structTreeParentId (added in PR #16916).	2023-09-16 18:34:58 +02:00
Jonas Jenwald	b5b061cdb6	Slightly re-factor the parameter handling in `Catalog.parseDestDictionary` While it makes sense to check that the `destDict` parameter is indeed a Dictionary, since that data comes from the PDF document itself, the `resultObj` parameter is an internal PDF.js implementation detail that should always be correct (or tests will fail).	2023-09-08 13:27:31 +02:00
Jonas Jenwald	df9cce39c0	Slightly reduce asynchronicity when parsing Annotations Over time the amount of "document level" data potentially needed during parsing of Annotations have increased a fair bit, which means that we currently need to ensure that a bunch of data is available for each individual Annotation. Given that this data is "constant" for a PDF document we can instead create (and cache) it lazily, only when needed, before starting to parse the Annotations on a page. This way the parsing of individual Annotations should become slightly less asynchronous, which really cannot hurt. An additional benefit of these changes is that we can reduce the number of parameters that need to be explicitly passed around in the annotation-code, which helps overall readability in my opinion. One potential drawback of these changes is that the `AnnotationFactory.create` method no longer handles "everything" on its own, however given how few call-sites there are I don't think that's too much of a problem.	2023-09-08 13:27:27 +02:00
Jonas Jenwald	64e8557fb5	[api-minor] Deprecate the `PDFDocumentProxy.getJavaScript` method This method is very old, however with the exception of the auto-print hack (when scripting is disabled) in the viewer it's never actually been used. Most likely the idea with `PDFDocumentProxy.getJavaScript` was that it'd be useful if scripting support was added, however it turned out that it was a bit too simplistic and instead a number of new methods were added for the scripting use-cases.	2023-08-01 09:02:05 +02:00
Jonas Jenwald	1b4a7c5965	Introduce more optional chaining in the `src/core/` folder After PR 12563 we're now free to use optional chaining in the worker-thread as well. (This patch also fixes one previously "missed" case in the `web/` folder.) For the MOZCENTRAL build-target this patch reduces the total bundle-size by `1.6` kilobytes.	2023-05-15 12:38:28 +02:00
Calixte Denizet	cfb908c999	Add a cache to avoid to load several times a local font On my computer, it takes few tenths of a second to load a local font. Since a font can be used several times in a document, the cache will improve performances.	2023-05-10 20:01:21 +02:00
Jonas Jenwald	d950b91c4e	Introduce some logical assignment in the `src/core/` folder	2023-04-29 13:49:37 +02:00
Jonas Jenwald	5f64621d46	Use `String.prototype.replaceAll()` where appropriate This fairly new method allows replacing multiple occurrences within a string without having to use regular expressions. Please refer to: - https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/replaceAll - https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/replaceAll#browser_compatibility	2023-03-22 15:31:10 +01:00
Jonas Jenwald	23930a249e	[api-minor] Let `Catalog.getAllPageDicts` return an empty dictionary when loading the first /Page fails (issue 15590) In order to support opening certain corrupt PDF documents, particularly hand-edited ones, this patch adds support for letting the `Catalog.getAllPageDicts` method fallback to returning an empty dictionary to replace (only) the first /Page of the document. Given that the viewer cannot initialize/load without access to the first page, this will thus allow e.g. document-level scripting to run as expected. Note that by effectively replacing a corrupt or missing first /Page in this way[1], we'll now render nothing but a blank page for certain cases of broken/corrupt PDF documents which may look weird. Please note: This functionality is controlled via the existing `stopAtErrors` option, that can be passed to `getDocument`, since it's easy to imagine use-cases where this sort of fallback behaviour isn't desirable. --- [1] Currently we still require that a /Pages-dictionary is found though, however it may be possible to relax even that assumption if that becomes absolutely necessary in future corrupt documents.	2022-11-03 12:51:48 +01:00
Jonas Jenwald	d470010293	Re-factor the PDF version parsing in the worker-thread Part of this is very old code, and back when support for parsing the catalog-version was added things became less clear (in my opinion). Hence this patch tries to improve things, by e.g. validating the header- and catalog-version separately.	2022-10-15 12:06:39 +02:00
Jonas Jenwald	ce66fefbff	[api-minor] Add partial support for the "GoToE" action (issue 8844) Please note: The referenced issue is the only mention that I can find, in either GitHub or Bugzilla, of "GoToE" actions. Hence why I've purposely settled for a very simple, and partial, "GoToE" implementation to avoid complicating things initially.[1] In particular, this patch only supports "GoToE" actions that references the /EmbeddedFiles-dict in the PDF document. See https://web.archive.org/web/20220309040754if_/https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf#G11.2048909 --- [1] Usually I always prefer having real-world test-cases to work with, whenever I'm implementing new features.	2022-10-06 10:33:07 +02:00
Jonas Jenwald	60f6272ed9	Use more `for...of` loops in the code-base Most, if not all, of this code is old enough to predate the general availability of `for...of` iteration.	2022-10-03 13:08:38 +02:00
Jonas Jenwald	cc4baa2fe9	[api-minor] Add basic support for the `SetOCGState` action (issue 15372) Note that this patch implements the `SetOCGState`-handling in `PDFLinkService`, rather than as a new method in `OptionalContentConfig`[1], since this action is nothing but a series of `setVisibility`-calls and that it seems quite uncommon in real-world PDF documents. The new functionality also required some tweaks in the `PDFLayerViewer`, to ensure that the `layersView` in the sidebar is updated correctly when the optional-content visibility changes from "outside" of `PDFLayerViewer`. --- [1] We can obviously move this code into `OptionalContentConfig` instead, if deemed necessary, but for an initial implementation I figured that doing it this way might be acceptable.	2022-09-01 17:34:24 +02:00
Jonas Jenwald	216b86a082	[api-minor] Support Named-actions in the outline (issue 15367) Apparently this is implemented in e.g. Adobe Reader, and the specification does support it, however it cannot be commonly used in real-world PDF documents since it took over ten years for this feature to be requested.	2022-08-30 18:47:45 +02:00
Calixte Denizet	5f0c95e70e	[JS] Embedded JS scripts can have some null chars	2022-07-15 16:05:25 +02:00
Jonas Jenwald	9ac4536693	Enable the `unicorn/prefer-at` ESLint plugin rule (PR 15008 follow-up) Please find additional information here: - https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/at - https://github.com/sindresorhus/eslint-plugin-unicorn/blob/main/docs/rules/prefer-at.md	2022-06-09 21:21:19 +02:00
Jonas Jenwald	df5a4fd0a7	Support encoded dest-strings in /GoTo destination dictionaries (issue 14864) Interestingly enough this appears to be the very first case of encoded dest-strings, in /GoTo destination dictionaries, that we've actually come across. What's really fascinating is that it's less than a week after issue 14847, given that these issues are somewhat similar.	2022-05-02 10:14:32 +02:00
Jonas Jenwald	71370d012b	Support destinations in NameTrees with encoded keys (issue 14847) Initially I considered updating the `NameOrNumberTree`-implementation to handle encoded keys, however that quickly became somewhat messy (especially in the `NameOrNumberTree.get`-method) since only NameTrees using string-keys. Hence the easiest solution, as far as I'm concerned, was thus to just update the `Catalog.destinations`-getter instead. Please note that in the referenced PDF document the `Catalog.destination`-method will thus fallback to fetch all destinations, which should be fine since this is the very first case of encoded keys that we've seen. Also changes the `NameOrNumberTree.getAll`-method to prevent a possible run-time error, although we've so far not seen such a case, for any non-Array Kids-entries found in a NameTree/NumberTree. Finally, to improve overall consistency and to hopefully prevent future bugs, the patch also updates a couple of other `NameTree` call-sites to correctly handle encoded keys. (Note that the `Catalog.attachments`-getter was already doing this.)	2022-04-27 11:19:55 +02:00
Jonas Jenwald	5bc7339c1b	Add support for the /Catalog Base-URI when resolving URLs (issue 14802) As far as I can tell, this is actually the very first time that we've seen a PDF document with a Base-URI specified in the /Catalog; please refer to the specification: https://web.archive.org/web/20220309040754if_/https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf#G11.2097122 To simplify the overall implementation, this new parameter is accessed via the existing `BasePdfManager.docBaseUrl`-getter and will thus override any user-specified `docBaseUrl` API-parameter.	2022-04-19 17:14:52 +02:00
Jonas Jenwald	a919959d83	Slightly simplify the `Catalog._readMarkInfo` method We don't need to first check if the Dictionary contains the key, since trying to get a non-existent key simply returns `undefined` and we're already ensuring that the value is a boolean. Furthermore, we shouldn't need to worry about the `Object.prototype` containing enumerable properties since the checks (in `src/core/worker.js`) done for `Array.prototype` indirectly also cover `Object`s. (Keep in mind that an `Array` is just a special kind of `Object` in JavaScript.)	2022-04-05 16:37:51 +02:00
Jonas Jenwald	addb4cb12b	Use `String.prototype.repeat()` in a couple of spots Rather than using a temporary Array to manually create repeated strings, we can use `String.prototype.repeat()` instead. The reason that we didn't use this from the start is most likely because some browsers, notably IE, didn't support this; note https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/repeat#browser_compatibility	2022-03-30 15:42:40 +02:00
Jonas Jenwald	c0736647f9	Add general iteration support in the `RefSet` and `RefSetCache` classes This patch removes the existing `forEach` methods, in favor of making the classes properly iterable instead. Given that the classes are using a `Set` respectively a `Map` internally, implementing this is very easy/efficient and allows us to simplify some existing code.	2022-03-18 14:27:34 +01:00
Jonas Jenwald	939e6f0c4c	Fix a couple of small typos in JSDoc `typedef` comments While this doesn't affect the official API documentation, these cases should nonetheless be fixed.	2022-03-04 12:11:52 +01:00
Jonas Jenwald	99cd24ce3e	Remove the `isString` helper function The call-sites are replaced by direct `typeof`-checks instead, which removes unnecessary function calls. Note that in the `src/`-folder we already had more `typeof`-cases than `isString`-calls.	2022-02-26 16:33:41 +01:00
Jonas Jenwald	3704283f5b	Remove the `isBool` helper function The call-sites are replaced by direct `typeof`-checks instead, which removes unnecessary function calls.	2022-02-23 13:31:03 +01:00
Jonas Jenwald	82f1ee1755	Re-factor the `Catalog.viewerPreferences` method This removes the `ViewerPreferencesValidators` structure, and thus (slightly) simplifies the code overall. With these changes we only have to iterate through, and validate, the actually available Dictionary entries.	2022-02-23 13:25:56 +01:00
Jonas Jenwald	05edd91bdb	Remove the `isNum` helper function The call-sites are replaced by direct `typeof`-checks instead, which removes unnecessary function calls. Note that in the `src/`-folder we already had more `typeof`-cases than `isNum`-calls. These changes were mostly done using regular expression search-and-replace, with two exceptions: - In `Font._charToGlyph` we no longer unconditionally update the `width`, since that seems completely unnecessary. - In `PDFDocument.documentInfo`, when parsing custom entries, we now do the `typeof`-check once.	2022-02-22 11:55:34 +01:00
Jonas Jenwald	b282814e38	Prefer `instanceof Name` rather than calling `isName()` with one argument Unless you actually need to check that something is both a `Name` and also of the correct type, using `instanceof Name` directly should be a tiny bit more efficient since it avoids one function call and an unnecessary `undefined` check. This patch uses ESLint to enforce this, since we obviously still want to keep the `isName` helper function for where it makes sense.	2022-02-21 12:45:00 +01:00
Jonas Jenwald	4df82ad31e	Prefer `instanceof Dict` rather than calling `isDict()` with one argument Unless you actually need to check that something is both a `Dict` and also of the correct type, using `instanceof Dict` directly should be a tiny bit more efficient since it avoids one function call and an unnecessary `undefined` check. This patch uses ESLint to enforce this, since we obviously still want to keep the `isDict` helper function for where it makes sense.	2022-02-21 12:44:56 +01:00
Jonas Jenwald	2cb2f633ac	Remove the `isRef` helper function This helper function is not really needed, since it's just a wrapper around a simple `instanceof` check, and it only adds unnecessary indirection in the code.	2022-02-19 15:33:42 +01:00
Jonas Jenwald	1a31855977	Remove the `isStream` helper function At this point all the various Stream-classes extends an abstract base-class, hence this helper function is no longer necessary and only adds unnecessary indirection in the code.	2022-02-17 13:51:36 +01:00
Jonas Jenwald	8836593b9e	Add a (global) cache to the `getCharUnicodeCategory` function Given that the regular expression has already become more complex (after the initial patch adding it), it seems to me that it probably cannot hurt to add a global cache to reduce unnecessary re-parsing. Obviously the `Glyph`-instances are being cached per font, however in most documents multiple fonts are being used and in practice there's very often a fair amount of overlap between the /ToUnicode-data in different fonts[1]. Consider for example loading and rendering the entire `tracemonkey.pdf` document (from the test-suite), which isn't a particularily large document. In that case the `getCharUnicodeCategory` function is being called a total of `601` times, however there's only `106` unique unicode-chars being checked. Please note: In practice I suppose that this won't have a huge effect on overall performance, however given the relative simplicity of this patch I figured that it'd not hurt to submit it for review. --- [1] Consider e.g. how there's usually different fonts used for regular, bold, respectively italic text.	2022-01-25 09:59:34 +01:00
Jonas Jenwald	b0e774d9c5	Convert `Catalog.getAllPageDicts` to an `async` method The patch in PR 14335 essentially re-introduced the old code from before PR 3848, however looking at this code a bit closer it should be possible to simplify it by making the method asynchronous. While this method is currently only used as a fallback in corrupt documents, the way that `MissingDataException`s are handled is less than ideal. Note that if a `MissingDataException` is thrown, we're forced to re-parse the entire /Pages tree[1]. With this method now being asynchronous, we're able to handle fetching of References in a much easier/nicer way than before without having to throw `MissingDataException`s and re-parse anything. These changes also let us simplify the call-site slightly, by calling the method directly instead of using the `PDFManager`-instance (since again it will no longer throw `MissingDataException`s). Furthermore, this patch contains the following other changes: - Reduce unnecessary duplication in the various `catch` handlers throughout the method, by simply moving the `XRefEntryException` handling into the `addPageError` helper function instead. - Move the "circular references"-check to occur slightly earlier, since there's obviously no point in asynchronously fetching data just to then throw an Error immediately afterwards. --- [1] Imagine e.g. a thousand page document, where there's a `MissingDataException` thrown when fetching/parsing page 900.	2021-12-31 22:03:10 +01:00
Jonas Jenwald	1491459dea	Improve caching for the `Catalog.getPageIndex` method (PR 13319 follow-up) This method is now being used a lot more, compared to when it's added, since it's now used together with scripting as part of the `PDFDocument.fieldObjects` parsing (called during viewer initialization). For /Page Dictionaries that we've already parsed, the `pageIndex` corresponding to a particular Reference is already known and we're thus able to skip all parsing in the `Catalog.getPageIndex` method for those cases.	2021-12-29 20:29:14 +01:00

1 2