Help with PicoSearch

What kinds of automatic data extraction can I do to enrich the search results?

Page Definitions are a powerful new feature for PicoSearch which enable Automatic Data Extraction to enrich search results. Data Extraction is a general term for grabbing patterns of information to fill categories. The goal of page definitions is to define patterns that PicoSearch will match against when indexing pages from your website. When a pattern matches for a particular page, the search results for that page can show more specific information than the default layout would allow. These patterns are entered in your account manager's Automatic Data Extraction section.

An example of data extracting would be setting patterns for finding linked pictures such as product photos to be seen in the search results. True, linked pictures can also be specified from special PicoSearch linked picture tags inserted directly into your own website pages. But page definition patterns can avoid having to edit your website while also catching more cases, assuming the images are already on the pages to be found.

Because Page Definitions are a set of patterns to trigger at indexing time, the set may be extended by the PicoSearch staff to fill various customer needs. Therefore you can check back at this FAQ, or see the What's New, to find out when a new data extracting ability has been added to the Page Definitions feature.

Required Name and Scope : All Page Definitions require a name and scope, just to delineate the set of patterns to follow as you enter them in your account manager's Automatic Data Extraction section. The name for the block of patterns is specified by Result: followed by any string that makes sense to you, since it's really just for your own reference as you start a new definition set. The patterns of the Result set will then be matched only against those website pages that fit the scope of the set, which is defined by one or more URL_scope: patterns. These patterns can have wildcards of * for any number of optional characters, or ? for one optional character. So for example:

Result: Little Product Photos
URL_scope: */product_catalog/*


In this set of page definitions, the first requirement is that the URLs be in the product_catalog directory. Whatever other data extracting patterns come next will only be tested for these pages. If you have more than one URL pattern, you can list multiple URL_scope: patterns and any that match will trigger the rest of the data extracting set. So that's like a logical "or" specification for qualifying URLs. An "and" specification should be done with wildcards as much as possible, and the regex modifier if necessary (it's trickier).

Regex modifiers (advanced) : If you have more complex qualification for any pattern in the Result set, you can make a regular expression in a single line. Just put the qualifier /regex before the colon, and write regular expression syntax as it must match from start to end. This means that * and ? are no longer easy shorthands, because real regular expression syntax is .* and .? to match many or one optional character(s). If you're not comfortable with regular expressions, you can contact us for help, or just make additional Result sets to separate the cases at the expense of a little duplication. But if you want to try a regex to be most compact, here's an example that would match URLs that have both "product_catalog" and "annual" in the URL, but in either order.

URL_scope/regex: (.*product_catalog.*annual.*|.*annual.*product_catalog.*)

And now that we have a Result name and URL scope specified, we can look at the different Automatic Data Extraction tasks available to Page Definitions. This list of features will expand in the future, so we'll list them in the order that we make them to suite customer needs.


  1. Finding Linked Pictures (photos, icons) : A linked picture next to a search result can be accomplished by inserting special PicoSearch tags in your web pages, see this FAQ for details. But assuming the image URLs are on the page already to be found, a Page Definition Result set using link_pic patterns will be most efficient. You won't have to edit your website, as long as you remember to maintain the pattern in your HTML and make any changes for new patterns in your PicoSearch account manager.

    The patterns for finding linked pictures work something like the patterns for Excluded tags. The idea is to help PicoSearch locate a position in your HTML from which the very next image it finds will be the one to show for the page in the search results. So the pattern is for an HTML tag, with one optional attribue that may have a value pattern. This tag can be the HTML img tag itself that contains the correct picture, or any other tag that precedes the img tag for the correct picture.

    For example, if the product pages on your website were in the product catalog directory, and they always had an HTML image tag with the attribute class="ProductPhotoNNN", where NNN is a number, then the following Page Definition Result set would locate these photos during indexing and attach them to the search result for each page:

    Result: Little Product Photos
    URL_scope: */product_catalog/*
    link_pic: <img class="ProductPhoto*">


    So what's happening here is that whenever PicoSearch indexes a page on your site from the /product_catalog/ directory, it will check to see if there's an HTML image tag that has a matching class anywhere in the tag attributes. The first tag that matches starting from the top of the page will contribute its URL for the image from the HTML src="...." attribute. The * wildcard allows for any number of characters in the value before the end quote, so the NNN number is allowed. Or if we wanted to match no more than three characters we could use three ? for one optional character each.

    As described above for the URL_scope pattern, the link_pic pattern can take the regex modifier to support an actual regular expression for what to match in your HTML, for where to find at that point or further an image tag. This is recommended for special situations and advanced users only.

    As mentioned, the link_pic pattern doesn't have to be for the image tag itself. If the first image after the HTML body tag on your page is the picture to show for the search result, then you could use a link pattern as simple as this, with no attribute at all:

    Result: Little Product Photos
    URL_scope: */product_catalog/*
    link_pic: <body>


    What about the attributes of the image, where will they come from? The link on the picture in the search results will be the same as the link on the title of the search result. Thus you can use the template code PICO_CLASS_A_TITLE to set the class and thus affect the link style.

    Attributes: The width and height of any linked picture will default to the actual size of the image file. This size can be then overridden for all linked pictures in the search results from the Tables & Columns section of your account manager, which is a great place to organize images and layout to make the search results look like a product catalog. If only width or height is set, the image is scaled proportionately. Any image that was found by a link_pic pattern can have its width and/or height specified by the set: and include: modifiers, which can bring in other attributes as well. The Tables & Columns size will still overrule.

    So for any attributes that you would like to enforce on all linked pictures found by a particular link_pic pattern, put set: after the tag pattern. For example, the following will enforce a width of 50 and a border of 3 on all images found in the Little Product Photos result set.

    Result: Little Product Photos
    URL_scope: */product_catalog/*
    link_pic: <img class="ProductPhoto*"> set: width="50" border="3"


    If you would like to preserve the attribute values that were found on your own page, use the include: modifier to list attributes and optional value patterns. Only when these are found with the image itself, then they will be carried over into the search results. For example, the following will pick up any alt text descriptions for images as found on the page, and the hspace only if it is just one digit wide (using the ? for one wildcard character). We'll also still set the border from the previous example.

    Result: Little Product Photos
    URL_scope: */product_catalog/*
    link_pic: <img class="ProductPhoto*"> include: alt hspace="?" set: border="3"


    Precedence: As mentioned above and explained on this FAQ, you can also edit any particular file to include a PICOLINKPIC tag that provides the linked picture including optional attributes. This tag will take precedence over a link_pic pattern by default. But any particular data extracting pattern can take precedence over tags if you include the "override" modifier on a pattern, like this: link_pic/override:















Back to FAQs