access raw data of page Umbraco #help-with-umbraco

access raw data of page

OMegaliterosMalakas

12/11/2024, 7:20 PM

Is there anything page/querystring magic that can be done to dump out in xml/json the underlying data of a page, for use in search crawling/parsing?

Brendan

12/12/2024, 3:36 AM

Can you explain what you are trying to do?

OMegaliterosMalakas

12/12/2024, 3:31 PM

We are making a site for a customer. They have a search engine they want to crawl the site (multiple times per day! their info is going to change quickly). But they would prefer not to crawl the human readable version of the site, but rather a more "structured" version of the information so its not just text on the screen, they can see what field names map to what text etc. So basically I want to give them a read only version of the edit screen, or like the usync data, so they can ingest that

Luuk Peters (Proud Nerds)

12/12/2024, 3:34 PM

Perhaps you can use the content delivery api for that. I mean, it's all data in JSON format. With the correct query parameters you can get what you want I think. AND if neccessary you can prevent public access and use an API key for access so that not everyone can call the API

OMegaliterosMalakas

12/12/2024, 3:37 PM

thank you, I will look at that

Dean Leigh

12/12/2024, 10:37 PM

+1 for content delivery API but if it's for searching you could also use embedded JSON-LD in the rendered page.

Luuk Peters (Proud Nerds)

12/13/2024, 7:49 AM

It depends on how much data you want, you do make the HTML download bigger (and therefor slower).

Dean Leigh

12/13/2024, 7:52 AM

It does indeed slow performance but many sites are using JSON-LD for SEO anyway. I often use Microdata for that reason, which could also be crawled.

Luuk Peters (Proud Nerds)

12/13/2024, 7:56 AM

Agreed, but in this case, it seems that ALL data of a page should be available in structured data. I'm not sure if you want that much data added to the HTML. @OMegaliterosMalakas I'm curious though. They want to index structured data instead of the human readable page. But usually, the human readable page provides much more sematic value to what is otherwise just text. I'm talking about titles, headers, sections etc. Or you need to provide additional semantic value using JSON-LD. But for a search crawler to just read the 'plain' JSON/XML or whatever doesn't provide nearly as much semantics. What is wrong with indexing the human readable version?

Dean Leigh

12/13/2024, 8:08 AM

I think you just made my point better than I did 🙂 The output from Content Delivery API is, as you say 'plain JSON' so marking up the page would be beneficial in numerous ways. Going a little off topic, there are many signs that Structured Data for search may no longer be required as AI improves. I'm sure that won't be of interest to the client for now though.

7 Views

Previous Next