Issue with Filtering PDF Search by Path in Examine...
# help-with-umbraco
o
Hey everyone, I'm implementing a PDF search using Examine in Umbraco 13, and everything is working except for filtering by path. What Works βœ… - Searching for text inside PDFs works using fileTextContent. - Searching by nodeName also works. - If I don't filter by path, I get results. What Doesn't Work ❌ - When I try to filter by path, I get zero results. - The same query works in Examine Management but not in my code. How My Examine Data Looks: In Examine Management, when I search for a PDF, I see this in the index:
Copy code
path: -1,1600,1603,1611,2227
__IndexType: pdf
nodeName: Marketing Document
fileTextContent: "This document contains marketing strategies..."
- 1603 is my media root folder (selected by the user). - 2227 is the actual PDF file inside /HQM/blog1/. Lucene Query That Works in Examine Management If I manually search in Examine Management with:
Copy code
fileTextContent:marketing~2 OR nodeName:marketing~2
I get results. Lucene Query That My Code Generates (Fails) Here’s what my code generates:
Copy code
Lucene Query: { Category: pdf, LuceneQuery: +(fileTextContent:marketing~2 nodeName:marketing~2) +path:1603* }
This returns zero results ❌, even though 1603* should match -1,1600,1603,1611,2227.
What I Tried So Far πŸ”„ ❌ criteria.And().Field("path", $"{rootNode.Id}*") Fails: Lucene Query: { Category: pdf, LuceneQuery: +(fileTextContent:marketing~2 nodeName:marketing~2) +path:1603 } β†’ No results. Wildcard is ignored. ❌ criteria.And().NativeQuery($"path:{rootNode.Id}*") Fails: Lucene Query: { Category: pdf, LuceneQuery: +(fileTextContent:marketing~2 nodeName:marketing~2) +path:1603* } β†’ No results. ❌ criteria.And().Field("path", "*1603*") Fails: Lucene Query: { Category: pdf, LuceneQuery: +(fileTextContent:marketing~2 nodeName:marketing~2) +path:*1603* } β†’ No results. Throws error: '*' or '?' not allowed as first character in WildcardQuery How Can I Filter PDFs by Path Correctly? I need to filter PDFs so that only those inside a specific media folder (e.g., 1603) appear. Does anyone have experience with Examine's path filtering in Umbraco? Why is path:1603* returning nothing, even though the paths exist? Is there a better way to search PDFs within a media folder and its children? Thanks in advance! πŸš€
j
> 1603* should match -1,1600,1603,1611,2227. This is technically not true..
*1603*
should match, or
-1,1600,1603*
should. I haven't worked with paths in the PDF index, but it will depend on how the values are indexed. You can't visually see a difference in the backoffice viewer between values indexed as:
"1,2,3,4"
and
[1,2,3,4]
But the way Examine handles a "multivalue field" and a single value field containing a string of multiple comma separated values are very different
o
Thanks for the clarification. I debugged the results in Visual Studio and inspected the path property of a matching PDF result. It matches exactly what I see in the backoffice Examine Management UI. Here’s what I found: The path value of the result is: -1,1600,1603,1611,2227. When I use +path:1603* in my Lucene query, it still does not return results even though 1603 is part of the path. I also tried using +path:*1603* as a RawQuery to account for all possible matches where 1603 appears anywhere in the path, but it seems not to work or throws errors when attempting the wildcard logic. Do you know if Examine treats path fields differently when querying? Or is there something else I should check to ensure this works? Thanks for your help! https://cdn.discordapp.com/attachments/1334850818012352575/1334854616843685969/image.png?ex=679e0bbd&is=679cba3d&hm=d55748fa8ab8726f75c10c716a459532b6ae2f1d7af3bb3fd1a1ceefdb74ac12&
j
Historically with Examine people will edit the path field and make it index the values separately or with spaces in between. IIRC it is because Lucene strips specific characters when it searches, among which is the comma. So if you index
"-1,1600,1603,1611,2227"
lucene will treat it as
"11600160316112227"
which leads to issues when filtering by one of the node ids. My guess is that is what you are running into here. So indexing the path in a TransformingIndexValues event may fix your issue
o
I solved it like this:
Copy code
using Examine;
using Umbraco.Cms.Core.Events;
using Umbraco.Cms.Core.Notifications;

namespace HunzikerIntranet.HQM.Umbraco.Infrastructure.Examine
{
    public class PDFIndexPathTransformer : INotificationHandler<UmbracoApplicationStartedNotification>
    {
        private readonly IExamineManager _examineManager;
        private readonly ILogger<PDFIndexPathTransformer> _logger;

        public PDFIndexPathTransformer(IExamineManager examineManager, ILogger<PDFIndexPathTransformer> logger)
        {
            _examineManager = examineManager;
            _logger = logger;
        }

        public void Handle(UmbracoApplicationStartedNotification notification)
        {
            if (!_examineManager.TryGetIndex("PDFIndex", out var pdfIndex))
            {
                _logger.LogError("PDFIndex not found in Examine.");
                return;
            }

            pdfIndex.TransformingIndexValues += IndexOnTransformingIndexValues;
        }

        private void IndexOnTransformingIndexValues(object? sender, IndexingItemEventArgs e)
        {
            if (!e.ValueSet.Values.ContainsKey("path")) return;

            var rawPath = e.ValueSet.Values["path"].FirstOrDefault()?.ToString();
            if (string.IsNullOrEmpty(rawPath)) return;

            var searchablePath = rawPath.Replace(",", " ");

            var indexFields = e.ValueSet.Values.ToDictionary(x => x.Key, x => x.Value.ToList());
            indexFields["searchablePath"] = new List<object> { searchablePath };

            e.SetValues(indexFields.ToDictionary(x => x.Key, x => (IEnumerable<object>)x.Value));
        }
    }
}
j
Yes exactly like that, so now it should work if you filter on searchablePath instead of path
o
yes its working now, I wanted to tell you that
@Jemayn thanks for ur help
j
Glad to hear it, and np πŸ™‚
4 Views