Struggling to index PDF files with examine
d
I am running a website with a search index that uses some advanced search features. My next objective is to include PDF files for searching. I've copied the code from the examine pdf plugin: https://github.com/umbraco/UmbracoExamine.PDF Though it basically works, my index fails to stay healthy. It concerns about 200 PDFs up to 30MB in size. It takes more than 2 hours for some reason to index the PDFs with content. On a testing environment that only stays up for 20 minutes when inactive, this seems to cause corruption in the index files as the application is shut down midway through indexing. Not to mention 2 hours to index all content is simply not acceptable. Relevant context: - Umbraco 10.8.3 - Media is saved in Azure blob storage (bottleneck is likely here) - Testing environment is an on-prem windows server with IIS. - Index supports: facets and spatial search. Does anyone have experience with this use-case and can give me advice on how to improve this?
The corruption is likely caused by the method by which facets and spatial search are supported: since these features aren't natively supported (yet) by examine, we hook into the events from the examine index to keep a separate raw lucene index in sync. It's this separate index that corrupts frequently.
j
The pdf package comes with a pdfTextService you can use to extract the text, then you can just use that to put into whatever index field you want it for. I have an example of just that in this blogpost: https://dev.to/jemayn/index-pdfs-on-their-pages-in-umbraco-30l1 About the native lucene index is the issue I can't say 🙂
d
Thanks for the pointer! I'll have a look
Looks like for the most part I already do the same. In this case I need both "content by referenced PDF" and just the PDFs as separate search results, but using the pdfTextService is still a good idea.
6 Views