Limit Page Crawler Content

The only way for Kentico to index content not in the Page Form / Text webparts is to use the Page Crawler method.  This method will load the page then scan for any text on the page from all sources, including repeaters.

The down side of this, is by default it scans the entire page, including the header and footer.  So if you have the text "Blog" on the header or footer, every single page will show up in the search results.

The Concept

What we need to do is to find a way to tell Kentico "The content is between A and B, index that only," and that's accomplished through adding some keywords for our "start" and "end" to the pages, then modifying the global event "DocumentEvents.GetContent.Execute"

This global event is fired when a page's content is requested by the Smart Search.  At this point the Content of the document is available, including our keywords.  Let's go through this step by step.

The Code

Create a file C# class file in your App_Code folder  (something like CustomSmartSearchContentLoader.cs), and add this code:

[Error loading the WebPart 'HighlightJS_HighlightedCode' of type 'HighlightJS_HighlightedCode']
Next, you need to add the Keywords to your page to define where the start and ends are (CONTENTSTART and CONTENTEND).  Additionally you can add the exclude start and end keywords to exclude chunks within the searchable content (EXCLUDESTART and EXCLUDEEND).

Since we don't want this text visible, but it must be actual text to be picked up by the crawler (you can't use HTML comments since those are ignored), instead I use the following code to the Master Template's layout.
[Error loading the WebPart 'HighlightJS_HighlightedCode1' of type 'HighlightJS_HighlightedCode']
A couple notes, i used <div style="display:none;">  so the keywords are not visible, but are scannable by the content grabber.

Likewise, you can create a WebPart Container that wraps webparts in the exclude tags if you don't want it included on a page's rendering:


Final Notes

One thing to note is that for some reason, although we are modifying the Content (and indeed the smart search uses that modified content to search), the Smart Search description that's rendered on the search results still uses the original Content (the entire page).

So in this logic I have added two fields, "UseCustomContent" and "CustomContent."

If you want the search results to show the proper content in your results then, you should look to add a transformation that looks like the below:

[Error loading the WebPart 'HighlightJS_HighlightedCode2' of type 'HighlightJS_HighlightedCode']
[Error loading the WebPart 'HighlightJS_HighlightedCode3' of type 'HighlightJS_HighlightedCode']
Trevor Fayas
The method should still be there, try SearchHelper.AddGeneralField, but I don't see the original method being removed.
12/2/2018 5:27:22 PM

Nice idea for filter content. I tried to use searchDocument.AddGeneralField method but is not find in kentico11, any advice ?
12/2/2018 5:16:17 PM

= four + five