Help with PicoSearch

How can I control the skipping of text or links in my search?

If you just want to skip (ignore) certain words in the user's search, that is controlled by the Set Stopwords feature in your account manager. If you want to skip or search (with or without some relative weighting) the major HTML parts of your pages (title, metas, body) then see the Index Modes section of your account manager.

But if you want to skip certain files, directories, or parts of pages in all of the searches (i.e. removing stuff from your index), try some of the following techniques:
  • PicoSearch Skip Tags: tags to put in your files to skip parts of the page, used only by PicoSearch

  • Robots Meta Tags: tags to put in your files to skip the page or links, used by all internet indexers including PicoSearch

  • Excluded Paths: patterns to put in your PicoSearch account manager, to skip entire files or directories, based on URLs or even HTML titles

  • Excluded Tags: patterns to put in your PicoSearch account manager, to conveniently skip or keep sections of your files (paid accounts only)


PicoSearch Skip Tags: To make your search engine ignore a specific section of an HTML document, skipping both the text and links, enclose the section to be skipped in the PICOSEARCH_SKIPALLSTART and PICOSEARCH_SKIPALLEND tags, seen below. These tags will look like harmless comments to anyone else besides PicoSearch, so they will not affect your code's display or your placement with general web search engines. (Note: old tags called NOSEARCHSTART and NOSEARCHEND are still supported, but were less precisely named)
 
<!--PICOSEARCH_SKIPALLSTART-->
Text and links skipped by the indexer.
<!--PICOSEARCH_SKIPALLEND-->


In the same way, to make your search engine ignore a specific section of text but still follow the links inside, use the PICOSEARCH_SKIPTEXTSTART and PICOSEARCH_SKIPTEXTEND tags. This is good for wrapping around a repeated area of navigation, so the words won't keep coming up but the linked documents will still be found. Finally, PICOSEARCH_SKIPLINKSTART and PICOSEARCH_SKIPLINKEND will ignore links while still indexing the text (just in case you ever need that).
 
<!--PICOSEARCH_SKIPTEXTSTART-->
Text skipped by the indexer. Links still followed.
<!--PICOSEARCH_SKIPTEXTEND-->


 
<!--PICOSEARCH_SKIPLINKSTART-->
Text is indexed. Links are skipped over.
<!--PICOSEARCH_SKIPLINKEND-->



Robots Meta Tags: To control the skipping of ALL text and/or links in a document, put ONE of the following robot meta tags in the head of your document. (The head of your document is between the <head> and </head> tags, and it is the same place where your document's title goes.)
 
<meta name="ROBOTS" content="NOINDEX" />
 
<meta name="ROBOTS" content="NOFOLLOW" />
 
<meta name="ROBOTS" content="NOINDEX,NOFOLLOW" />

 
NOINDEX alone says to not index the text of the document, but to keep following the links to other documents. NOFOLLOW alone says to index the text of the document, but do not follow any further links from the document. Using both will effectively create a dead end in your index, blocking your search engine from seeing or going beyond this document.
 
Please note that the ROBOTS tags may be honored by general web search engines as well as PicoSearch. Also, PicoSearch does honor the robots.txt file, see FAQ on robot exclusion protocol.

There is also a PicoSearch only version of the ROBOTS tags. Note that these are HTML comments, so they must begin and end as shown. Using these will work exactly like the general ROBOTS tags, but they will take priority for PicoSearch and will not affect other web search engines. You could even make the CONTENT say "INDEX,FOLLOW" if you wanted PicoSearch to index a page that the general ROBOTS tag said to "NOINDEX,NOFOLLOW".
 
<!-- picometa name="ROBOTS" content="NOINDEX" -->
 
<!-- picometa name="ROBOTS" content="NOFOLLOW" -->
 
<!-- picometa name="ROBOTS" content="NOINDEX,NOFOLLOW" -->



Excluded Paths: To control the skipping of entire files and directories of your website quickly and easily, be sure to visit your account management control panel's Exclusions section, under your Indexing topics. This feature is handy and easy to control, because it will only affect PicoSearch, and requires no editing of your web pages. You just enter in your account manager the patterns that will match the URLs you want to Exclude, with various possibilities as listed below. The options all combine too, so you can experiment fully. The URLs that will be matched against will be the fully qualified unique URLs that are expanded from your file's links during indexing. Note that anchors are chopped off of your page links, since they jump within a page and aren't really unique in themselves, so anchors won't match in Excluded Path patterns.
  • Left-to-right patterns: The basic exclusion pattern is a URL or URL fragment that matches left-to-right. Thus there is an implied wildcard at the end, so for example the pattern:
    http://www.mysite.com/junk_directory/
    will exclude all URLs that start with that pattern, including
    http://www.mysite.com/junk_directory/ and
    http://www.mysite.com/junk_directory/junk1.html
    Any URL link found during indexing that would otherwise be indexed but matches an exclusion pattern will be excluded from the searchable index, i.e. skipped by the indexer, along with its ensuing links. The only exception is an Entry Point, which cannot be excluded entirely, but can be excluded with links still followed (see the exclude follow pattern below). Another exception to note is a pattern based not on URLs but the HTML page titles (see title excludes below).


  • Wildcard patterns (*): You can add explicit wildcards anywhere in a pattern using the asterisk *, but as soon as you do then the implicit final wildcard of a left-to-right pattern is removed. Thus, the pattern:
    */junk_directory/
    will still exclude
    http://www.mysite.com/junk_directory/
    as well as something else like
    http://www.mysite.com/more/junk_directory/
    but not the additional files like
    http://www.mysite.com/junk_directory/junk1.html
    You can always add a final * to get the full effect again, thus the pattern
    */junk_directory/*
    will exclude any URL with /junk_directory/ somewhere inside of it.


  • Ended patterns ($): You can remove the implicit final wildcard of a left-to-right pattern by ending it with a dollar sign $. Adding an explicit wildcard anywhere would also remove the implicit final wildcard, but then of course you might match other URLs. One use of the dollar sign is if your server is generating default directory pages that you don't want searchable, but you still want everything in the directory to be searched.

    Thus, the pattern:
    http://www.mysite.com/junk_directory/$
    will exactly stop the URL
    http://www.mysite.com/junk_directory/
    while still allowing
    http://www.mysite.com/junk_directory/junk1.html
    although that assumes that the junk1.html link was found on some other page, since http://www.mysite.com/junk_directory/ was stopped. See also the see the exclude follow ~ pattern below.


  • Case-sensitive patterns ("): The Exclude patterns are normally case-insensitive, but if you need case-sensitivity then you can end the pattern with a double quote ". You might need this if your website has inconsistent usage of capital letters in the links, and then you discover that your server is sensitive to capitalization and even serves up slightly different pages that get past PicoSearch's Remove duplicates controls (in the account manager's Index Modes section).

    Thus, the pattern:
    http://www.mysite.com/Junk_Directory/"
    will stop the URL
    http://www.mysite.com/Junk_Directory/junk1.html
    while still allowing
    http://www.mysite.com/junk_directory/junk1.html


  • Exclude Text patterns (NOINDEX:): The Exclude patterns normally stop a URL completely, so the text in the page is not searchable and the links on the page are not followed. You can also do an Exclude Text pattern by starting it with NOINDEX: or with a tilde ~. This is just like using the NOINDEX option of a Robots Meta Tag or a PicoSearch SKIPTEXT Tag, but you don't have to edit your website (just remember that you made this account manager setting which could effect many pages if it includes wildcards). Thus, the text of the page is skipped but the links are still followed. This modified Exclude pattern can affect an Entry Point, unlike a total Exclude (it wouldn't make sense to completely exclude an Entry Point, since then it just shouldn't be an Entry Point).

    So for example, the pattern:
    NOINDEX:http://www.mysite.com/junk_directory/
    will exclude from the search this URL
    http://www.mysite.com/junk_directory/junk1.html
    while still evaluating any links referred to by files in the junk directory (in case it's the only way to reach another part of your website).


  • Exclude Links patterns (NOFOLLOW:): If you have URLs for which you want to only search the text and not follow any links, you can use NOFOLLOW: to start your Exclude pattern. This is just like using the NOFOLLOW option of a Robots Meta Tag or a PicoSearch SKIPLINK Tag, but you don't have to edit your website (just remember that you made this account manager setting which could effect many pages if it includes wildcards). This modified Exclude pattern can affect an Entry Point, unlike a total Exclude (it wouldn't make sense to completely exclude an Entry Point, since then it just shouldn't be an Entry Point).

    So for example, the pattern:
    NOFOLLOW:http://www.mysite.com/just_search_me_directory/
    will search the URL
    http://www.mysite.com/just_search_me_directory/just_search_me1.html
    without following any links referred to by this or any other file in the just search me directory.


  • Inverted patterns (NOT:): What if it's easier to say what to keep than what to get rid of? You can invert an exclude pattern by starting it with NOT: or with an exclamation point !. This will mean to exclude anything that doesn't match the pattern, which is very powerful and can combine with other excludes, so be careful. And if you depend on following links in all files, be sure to use NOINDEX:NOT: or ~! (the two modifiers combine perfectly well, see exclude follow patterns above).

    Thus, the exclude pattern:
    NOT:http://www.mysite.com/junk_directory/
    will make a search engine that consists only of the junk directory files. Note that this example might better be accomplished by using
    http://www.mysite.com/junk_directory/
    as an Entry Point with Directory Restriction. But other patterns that aren't limited to a single directory could be very useful, such as
    NOT:*current*
    to keep only files with "current" somewhere in the URL.


  • Title patterns (INTITLE:): You can also exclude by matching not on the URL but rather on the HTML title string, which is located between the <title> .... <\title> tags in the HTML header code of your website pages. This is handy if your site has predictable titles that indicate which pages should not be searched. To use a title pattern, just start the pattern with INTITLE: and don't expect the implicit final wildcard anymore (since title strings aren't as left to right matching as URLs).

    For example, the exclude pattern:
    INTITLE:*personal page*
    will cause your search index to skip any HTML page whose title declares it to be a personal page, such as
    <title>Personal Page for Mr. Jones<\title>




Excluded Tags: (for paying accounts only)

By using this powerful feature in your PicoSearch account manager's Exclusions features, you can make PicoSearch skip (or keep) certain tagged sections of your files without having to change the files themselves. This assumes that these sections already have uniquely identifying HTML tags around them, which can often be the case, thus saving the time of editing the files to insert PicoSearch Skip Tags.

For example, if you have the same navigation menus on all of your pages, you probably don't want to keep finding those menus in all your searches, because that will obscure pages that are really about those words. It would be best to skip common menus when indexing. So if your menus are always surrounded by the same tag (with an optional unique attribute value pair) then you can just enter that in the Excluded Tag box and reindex. And if you depend on the links in those menus for finding all pages, use a tilde (~) as described below for an exclude and follow effect.

Say your menus all have <div id="menu"> .... </div> around them. Entering an Excluded Tag of just <div> would skip everything inside all div tags, which is rather extreme but could be useful. Or entering <div id="menu"> will pick out just your menus, even if they also have other attributes in the way, such as <div class="happy" id="menu"> or <div id="menu" title="Big Menu">. So you don't need wildcards with Excluded Tags to match over extra attributes, although you might need wildcards within the key attribute's value, such as <div title="Big *"> to match all your Big titled div tags (title is a kind of HTML tag comment which can have any text inside the quotes).

Certain HTML tags will lend themselves well to Excluded Tags, such as <div>. and <span> and lists, because they are often used sparingly to define sections of a page. Of course a closing tag is needed to end the section, so <br> won't work and <p> works only if you remember to end with </p>. You're also welcome to try excluding more common tags like <table>, just be careful to use an attribute=value pair that is unique and won't get easily lost in future edits. The attractiveness of the Excluded Tag is not having to change your pages, but that's also the danger if you never add a comment to your source and you forget what you told PicoSearch to skip.

When you enter your Excluded Tags, the account manager will enforce quotes around attribute values, but PicoSearch will really match without quotes too (quotes are the strictest HTML but many people forget them for one-word values). Your account manager's Most Recent Indexing Log will then say the full tags that are matched per file, to help you verify that the patterns are working. It will say for example:

[Excluded Tag Span] <div id=menu class=happy>

The order of processing for Excluded Tags is the order in which you entered them in your account manager. Also, Excluded Tags are applied only after HTML comments and PicoSearch Skip Tags are stripped. So if you already have a section of your page skipped because of a comment or prior tag, you don't really need another Excluded Tag.

  • Exclude Text tags (NOINDEX:): Like with Excluded Paths, you have the option with Excluded Tags to exclude the text but still follow the links within any matching sections of your documents. This option is specified by preceding the tag pattern with a NOINDEX: or wtih a tilde ~. This could be vital if you are trying to keep the text of repeated navigation menus out of the search, but of course you still need to follow the navigation links to all pages. And excluded Javascript links will only be followed if you have the global option to "Allow tags that can block script parsing per file" turned off, in your account manager's Index Modes section.

    So for example, the following Excluded Tag pattern could cut your menus from the search while still following the menu links:
    NOINDEX:<div id="menu">


  • Exclude Links tags (NOFOLLOW:): Like with Excluded Paths, you have the option with Excluded Tags to search the text but not follow the links within any matching sections of your documents. This option is specified by putting NOFOLLOW in front of the tag pattern. Excluded Javascript links will only be followed if you have the global option to "Allow tags that can block script parsing per file" turned off, in your account manager's Index Modes section.

    So for example, the following Excluded Tag pattern could be used to search the text in lists of products but not follow the links to the product details:
    NOFOLLOW:<div id="product-details">


  • Inverted tags (NOT:): We also mentioned the ability to keep sections by Excluded Tags, so here's the trick. If you put a NOT: or exlamation point ! in front of the Excluded Tag, then it takes on an inverse meaning. Any page that has an inverted pattern will keep only that tagged section and get rid of everything else in the HTML body (so titles and metas are maintained). If no inverted patterns match, the page is not affected. If more than one inverted pattern matches, these sections get kept in the order of the inverted patterns that you entered.

    Most people won't even play with any inverted pattern at all, much less more than one. But if you do try these, then remember to enter them in the account manager in the order that they come on the page, or else the text can get a little rearranged (which isn't that big a deal anyway, it just might be occasionally noticed in the concordance). And if you depend on following links in excluded tag sections, be sure to use NOINEDX:NOT: or ~! (the two modifiers combine perfectly well, see Exclude Follow tags above).

    An example of a inverted pattern would be if you only wanted to search the tables with products, so then it might work to use this:
    NOT:<table id="products">
    Your Most Recent Indexing Log would then say matches like this:
    [Excluded all but Tag Span] <table id=products border=0>

Back to FAQs