I am running a website with a search index that uses some advanced search features. My next objective is to include PDF files for searching. I've copied the code from the examine pdf plugin:
https://github.com/umbraco/UmbracoExamine.PDF
Though it basically works, my index fails to stay healthy. It concerns about 200 PDFs up to 30MB in size. It takes more than 2 hours for some reason to index the PDFs with content. On a testing environment that only stays up for 20 minutes when inactive, this seems to cause corruption in the index files as the application is shut down midway through indexing. Not to mention 2 hours to index all content is simply not acceptable.
Relevant context:
- Umbraco 10.8.3
- Media is saved in Azure blob storage (bottleneck is likely here)
- Testing environment is an on-prem windows server with IIS.
- Index supports: facets and spatial search.
Does anyone have experience with this use-case and can give me advice on how to improve this?