Seeking advice on indexing PDF:s and Excel documen...
# help-with-umbraco
t
Our client needs to have PDF and Excel files indexed, and I'm wondering if someone has done anything similar within the Umbraco sphere or could point me in the right direction. Examine search can be made to also index pdf files, with some unicode caveats according to documentation. There's also the possibility to swap out the IPdfTextExtractor to iTextSharp but it's a separate license and seems to require extensive know-how. Also I'm not sure it can index Excel documents. The only solution I can think of at the top of my head is to deploy an ElasticSearch solution and write your own integration to update Elastic on Notifications from Umbraco. Any tips?
n
I've done indexing of PDF's in the past, but these days if someone wants this level of indexing, I would probably consider this https://examinex.online/
It provides indexing support for Umbraco Media file content (pdfs and office documents) - but it does have a subscription cost to it.
k
We index PDF contents without issues using
ExaminePdfComposer
. All you need to implement is extracting text from PDF files in your
IPdfTextExtractor
using any PDF library. It's super easy.
t
@kdx-perbol Are you using this package? https://github.com/umbraco/UmbracoExamine.PDF
k
That's the one!
t
If you were to try and index office documents as well, and there was a library available to parse said documents, would you try to expand this package somehow to also do that, or imitate its approach and build your own solution? Trying to figure out what part does what here, and if/how we could also expand it ourselves to handle Excel documents.
k
Good question. All
UmbracoExamine.PDF
seems to be doing is provide the glue between a custom PDF reader and Examine/UmbracoMedia. So no, I'd probably do what UmbracoExamine.PDF does for PDFs, with custom code for all Office document types. And then move the PDF stuff into that custom code as well and remove UmbracoExamine.PDF. Instead of creating an index per file-type (like UE.PDF does) I'd probably create a new Media index that has all indexed media files, regardless of file type.
t
Makes sense, I grasp them concepts, thank you!
k
Looking forward to your UmbracoExamine.Media package. 🙂
To also search automatically in the new index when searching, I think you register a Searcher and tell it to search both the External index and your custom index, and set this Searcher as default.
Does Elasticsearch have built-in PDF content indexing, or why would that make anything easier?
t
Well it's just that I know that ElasticSearch is one of the more common engines overall, and there is some sort of ingest tool available for ES that can handle PDF and Excel etc.
k
If you use such a megahammer on this tiny nail, I think you should go the full eXamineX way and use AI to extract/parse/analyze data from the indexed documents as well. Then you could search for the term "complex stuff" and get hits on PDFs that contain complex stuff. (But you'd also invent your own eXamineX in the process 🙂 (And eXamineX uses Azure Search for this, which afaik in turn uses Azure OpenAI or Azure AI))
t
ExamineX sure looks like the best alternative, except the pricing which would turn the client off. I'll experiment for a bit.
k
Just to summarize, * Simply indexing PDF files is a 10-lines-of-code exercise using UmbracoExamine.PDF. * Adding indexing Office documents requires making your own "UmbracoExamine.Office" (based on UE.PDF? (Unless there is an UE.Office already)) and extracting text from office documents, so some amount of custom development. * Adding Elasticsearch and/or Azure Search is a huge undertaking. * Switching to eXamineX is a major change to your application but is mostly configuration, not development.