homeplanssignuplog incontact ushelpabout PicoSearchPicoSearch news
Help with PicoSearch
> FAQs
> New Features
> What is PicoSearch?
> PicoSearch Glossary
> Sample Customers
> License



How can I control the skipping of text or links in my search?

If you just want to skip (ignore) certain words in the user's search, that is controlled by the Set Stopwords feature in your account manager.

But if you want to skip certain files, directories, or parts of pages in all of the searches (i.e. removing stuff from your index), try some of the following techniques:
  • PicoSearch Skip Tags: tags to put in your files to skip parts of the page, used only by PicoSearch

  • Robots Meta Tags: tags to put in your files to skip the page or links, used by all internet indexers including PicoSearch

  • Excluded Paths: patterns to put in your PicoSearch account manager, to skip entire files or directories, based on URLs or even HTML titles

  • Excluded Tags: patterns to put in your PicoSearch account manager, to conveniently skip or keep sections of your files (paid accounts only)


PicoSearch Skip Tags: To make your search engine ignore a specific section of an HTML document, skipping both the text and links, enclose the section to be skipped in the PICOSEARCH_SKIPALLSTART and PICOSEARCH_SKIPALLEND tags, seen below. These tags will look like harmless comments to anyone else besides PicoSearch, so they will not affect your code's display or your placement with general web search engines. (Note: old tags called NOSEARCHSTART and NOSEARCHEND are still supported, but were less precisely named)
 
<!--PICOSEARCH_SKIPALLSTART-->
Text and links skipped by the indexer.
<!--PICOSEARCH_SKIPALLEND-->


In the same way, to make your search engine ignore a specific section of text but still follow the links inside, use the PICOSEARCH_SKIPTEXTSTART and PICOSEARCH_SKIPTEXTEND tags. This is good for wrapping around a repeated area of navigation, so the words won't keep coming up but the linked documents will still be found. Finally, PICOSEARCH_SKIPLINKSTART and PICOSEARCH_SKIPLINKEND will ignore links while still indexing the text (just in case you ever need that).
 
<!--PICOSEARCH_SKIPTEXTSTART-->
Text skipped by the indexer. Links still followed.
<!--PICOSEARCH_SKIPTEXTEND-->


 
<!--PICOSEARCH_SKIPLINKSTART-->
Text is indexed. Links are skipped over.
<!--PICOSEARCH_SKIPLINKEND-->



Robots Meta Tags: To control the skipping of ALL text and/or links in a document, put ONE of the following robot meta tags in the head of your document. (The head of your document is between the <head> and </head> tags, and it is the same place where your document's title goes.)
 
<meta name="ROBOTS" content="NOINDEX" />
 
<meta name="ROBOTS" content="NOFOLLOW" />
 
<meta name="ROBOTS" content="NOINDEX,NOFOLLOW" />

 
NOINDEX alone says to not index the text of the document, but to keep following the links to other documents. NOFOLLOW alone says to index the text of the document, but do not follow any further links from the document. Using both will effectively create a dead end in your index, blocking your search engine from seeing or going beyond this document.
 
Please note that the ROBOTS tags may be honored by general web search engines as well as PicoSearch. Also, PicoSearch does honor the robots.txt file, see FAQ on robot exclusion protocol.

There is also a PicoSearch only version of the ROBOTS tags. Note that these are HTML comments, so they must begin and end as shown. Using these will work exactly like the general ROBOTS tags, but they will take priority for PicoSearch and will not affect other web search engines. You could even make the CONTENT say "INDEX,FOLLOW" if you wanted PicoSearch to index a page that the general ROBOTS tag said to "NOINDEX,NOFOLLOW".
 
<!-- picometa name="ROBOTS" content="NOINDEX" -->
 
<!-- picometa name="ROBOTS" content="NOFOLLOW" -->
 
<!-- picometa name="ROBOTS" content="NOINDEX,NOFOLLOW" -->



Excluded Paths: To control the skipping of entire files and directories of your website quickly and easily, be sure to visit your account management control panel's Exclusions section, under your Indexing topics. This feature is handy and easy to control, because it will only affect PicoSearch, and requires no editing of your web pages. You just enter in your account manager the patterns that will match the files you want to Exclude, with various possibilities as listed below. The options all combine too, so you can experiment fully.
  • left-to-right patterns: The basic exclusion pattern is a URL or URL fragment that matches left-to-right. Thus there is an implied wildcard at the end, so for example the pattern
    http://www.mysite.com/junk_directory/
    will exclude all URLs that start with that pattern, including
    http://www.mysite.com/junk_directory/ and
    http://www.mysite.com/junk_directory/junk1.html
    Any URL link found during indexing that would otherwise be indexed but matches an exclusion pattern will be excluded from the searchable index, i.e. skipped by the indexer, along with its ensuing links. The only exception is an Entry Point, which cannot be excluded entirely, but can be excluded with links still followed (see the exclude follow ~ pattern below). Another exception to note is a pattern based not on URLs but the HTML page titles (see title excludes below)

  • wildcard patterns (*): You can add explicit wildcards anywhere in a pattern using the asterisk *, but as soon as you do then the implicit final wildcard of a left-to-right pattern is removed. Thus, the pattern
    */junk_directory/
    will still exclude
    http://www.mysite.com/junk_directory/
    as well as something else like
    http://www.mysite.com/more/junk_directory/
    but not the additional files like
    http://www.mysite.com/junk_directory/junk1.html
    You can always add a final * to get the full effect again, thus the pattern
    */junk_directory/*
    will exclude any URL with /junk_directory/ somewhere inside of it.

  • ended patterns ($): You can remove the implicit final wildcard of a left-to-right pattern by ending it with a dollar sign $. Adding an explicit wildcard anywhere would also remove the implicit final wildcard, but then of course you might match other URLs. One use of the dollar sign is if your server is generating default directory pages that you don't want searchable, but you still want everything in the directory to be searched. Thus, the pattern
    http://www.mysite.com/junk_directory/$
    will exactly stop the URL
    http://www.mysite.com/junk_directory/
    while still allowing
    http://www.mysite.com/junk_directory/junk1.html
    although that assumes that the junk1.html link was found on some other page, since http://www.mysite.com/junk_directory/ was stopped. See also the see the exclude follow ~ pattern below.

  • case-sensitive patterns ("): The Exclude patterns are normally case-insensitive, but if you need case-sensitivity then you can end the pattern with a double quote ". You might need this if your website has inconsistent usage of capital letters in the links, and then you discover that your server is sensitive to capitalization and even serves up slightly different pages that get past PicoSearch's Remove duplicates controls (in the account manager's Index Modes section). Thus, the pattern
    http://www.mysite.com/Junk_Directory/"
    will stop the URL
    http://www.mysite.com/Junk_Directory/junk1.html
    while still allowing
    http://www.mysite.com/junk_directory/junk1.html

  • exclude follow patterns (~): The Exclude patterns normally stop a URL completely, so the text in the page is not searchable and the links on the page are not followed. You can also do an Exclude Follow pattern by starting it with a tilde ~. This is rather like using just the NOINDEX option of a Robots Meta Tag or a PicoSearch Skip Tag. Thus, the text of the page is skipped but the links are still followed. This is the only Exclude pattern that can affect an Entry Point (it doesn't make sense to completely exclude an Entry Point, since then it just shouldn't be an Entry Point). So for example, the pattern
    ~http://www.mysite.com/junk_directory/
    will stop the URL
    http://www.mysite.com/junk_directory/junk1.html
    while still evaluating any links referred to by files in the junk directory (in case it's the only way to reach another part of your website).

  • inverted patterns (!): What if it's easier to say what to keep than what to get rid of? You can invert an exclude pattern by starting it with an exclamation point !. This will mean to exclude anything that doesn't match the pattern, which is very powerful and can combine with other excludes, so be careful. Thus, the exclude pattern
    !http://www.mysite.com/junk_directory/
    will make a search engine that consists only of the junk directory files. Note that this example might better be accomplished by using
    http://www.mysite.com/junk_directory/
    as an Entry Point with Directory Restriction. But other patterns that aren't limited to a single directory could be very useful, such as
    !*current*
    to keep only files with "current" somewhere in the URL.

  • title patterns (INTITLE:): You can also exclude by matching not on the URL but rather on the HTML title string, which is located between the <title> .... <\title> tags in the HTML header code of your website pages. This is handy if your site has predictable titles that indicate which pages should not be searched. To use a title pattern, just start the pattern with INTITLE: and don't expect the implicit final wildcard anymore (since title strings aren't as left to right matching as URLs). For example, the exclude pattern
    INTITLE:*personal page*
    will cause your search index to skip any HTML page whose title declares it to be a personal page, such as
    <title>Personal Page for Mr. Jones<\title>



Excluded Tags: (for paying accounts only)

By using this powerful feature in your PicoSearch account manager's Exclusions features, you can make PicoSearch skip (or keep) certain tagged sections of your files without having to change the files themselves. This assumes that these sections already have uniquely identifying HTML tags around them, which can often be the case, thus saving the time of editing the files to insert PicoSearch Skip Tags.

For example, if you have the same navigation menus on all of your pages, you probably don't want to keep finding those menus in all your searches, because that may hide the pages that are really about those words. It would be best to skip common menus when indexing. So if your menus are always surrounded by the same tag (with an optional unique attribute value pair) then you can just enter that in the Excluded Tag box and reindex.

Perhaps your menus all have <div id="menu"> .... </div> around them. Entering an Excluded Tag of just <div> would skip everything inside all div tags, which is rather extreme but could be useful. Or entering <div id="menu"> will pick out just your menus, even if they also have other attributes in the way, such as <div class="happy" id="menu"> or <div id="menu" title="Big Menu">. So you don't need wildcards with Excluded Tags to match over extra attributes, although you might need wildcards within the key attribute's value, such as <div name="Big *"> to match all your Big titled div tags (title is a kind of HTML tag comment which can have any text inside the quotes).

Certain HTML tags will lend themselves well to Excluded Tags, such as <div> and <span> and lists, because they are often used sparingly to define sections of a page. Of course a closing tag is needed to end the section, so <br> won't work and <p> works only if you remember to end with </p>. You're also welcome to try excluding more common tags like <table>, just be careful to use an attribute=value pair that is unique and won't get easily lost in future edits. The attractiveness of the Excluded Tag is not having to change your pages, but that's also the danger if you never add a comment to your source and you forget what you told PicoSearch to skip.

When you enter your Excluded Tags, the account manager will enforce quotes around attribute values, but PicoSearch will really match without quotes too (quotes are the strictest HTML but many people forget them for one-word values). Your account manager's Most Recent Indexing Log will then say the full tags that are matched per file, to help you verify that the patterns are working. It will say for example:

[Excluded Tag Span] <div id=menu class=happy>

The order of processing for Excluded Tags is the order in which you entered them in your account manager. Also, Excluded Tags are applied only after HTML comments and PicoSearch Skip Tags are stripped. So if you already have a section of your page skipped because of a comment or prior tag, you don't really need another Excluded Tag.

We also mentioned the ability to keep sections by Excluded Tags, so here's the trick. If you put a ! in front of the Excluded Tag, then it takes on an inverse meaning. Any page that has a ! pattern will keep only that tagged section and get rid of everything else in the HTML body (so titles and metas are maintained). If no ! patterns match, the page is not affected. If more than one ! pattern matches, these sections get kept in the order of the ! patterns that you entered. Most people won't even play with ! patterns at all, much less more than one. But if you do try these, then remember to enter them in the account manager in the order that they come on the page, or else the text can get a little rearranged (which isn't that big a deal anyway, it just might be occasionally noticed in the concordance).

An example of a ! would be if you only wanted to search the tables with products, so your pages might be ready for applying the !<table id="products"> pattern. Then your Most Recent Indexing Log could say matches like:

[Excluded all but Tag Span] <table id=products border=0>

Back to FAQs

Patents Pending. Copyright © PicoSearch LLC