 |
|
 |
If you just want to skip (ignore) certain words in the user's search, that is controlled by the Set Stopwords feature in your account manager. If you want to skip or search (with or
without some relative weighting) the major HTML parts of your pages
(title, metas, body) then see the Index Modes section of your account
manager.
But if you want to skip certain files, directories, or
parts of pages in all of the searches (i.e. removing stuff from your
index), try some of the following techniques:
- PicoSearch Skip Tags: tags to put in your files to skip parts of the page, used only by PicoSearch
- Robots Meta Tags: tags to put in your files to skip the page or links, used by all internet indexers including PicoSearch
- Excluded Paths: patterns to put in your PicoSearch account manager, to skip entire files or directories, based on URLs or even HTML titles
- Excluded Tags: patterns to put in your PicoSearch account manager, to conveniently skip or keep sections of your files (paid accounts only)
PicoSearch Skip Tags: To make your search engine ignore a
specific section of an HTML document, skipping both the text and links,
enclose the section to be skipped in the PICOSEARCH_SKIPALLSTART and
PICOSEARCH_SKIPALLEND tags, seen below. These tags will look like
harmless comments to anyone else besides PicoSearch, so they will not
affect your code's display or your placement with general web search
engines. (Note: old tags called NOSEARCHSTART and NOSEARCHEND are still
supported, but were less precisely named)
<!--PICOSEARCH_SKIPALLSTART-->
Text and links skipped by the indexer.
<!--PICOSEARCH_SKIPALLEND-->
In the same way, to make your search engine ignore a specific section of
text but still follow the links inside, use the
PICOSEARCH_SKIPTEXTSTART and PICOSEARCH_SKIPTEXTEND tags. This is good
for wrapping around a repeated area of navigation, so the words won't
keep coming up but the linked documents will still be found. Finally,
PICOSEARCH_SKIPLINKSTART and PICOSEARCH_SKIPLINKEND will ignore links
while still indexing the text (just in case you ever need that).
<!--PICOSEARCH_SKIPTEXTSTART-->
Text skipped by the indexer. Links still followed.
<!--PICOSEARCH_SKIPTEXTEND-->
<!--PICOSEARCH_SKIPLINKSTART-->
Text is indexed. Links are skipped over.
<!--PICOSEARCH_SKIPLINKEND-->
Robots Meta Tags: To control the skipping of ALL text and/or
links in a document, put ONE of the following robot meta tags in the
head of your document. (The head of your document is between the <head> and </head> tags, and it is the same place where your document's title goes.)
<meta name="ROBOTS" content="NOINDEX" />
<meta name="ROBOTS" content="NOFOLLOW" />
<meta name="ROBOTS" content="NOINDEX,NOFOLLOW" />
NOINDEX alone says to not index the text of the document, but to keep following the links to other documents. NOFOLLOW alone says to index the text of the document, but do not follow any
further links from the document. Using both will effectively create a
dead end in your index, blocking your search engine from seeing or going
beyond this document.
Please note that the ROBOTS tags may
be honored by general web search engines as well as PicoSearch. Also,
PicoSearch does honor the robots.txt file, see FAQ on robot exclusion protocol.
There is also a PicoSearch only version of the ROBOTS tags. Note
that these are HTML comments, so they must begin and end as shown.
Using these will work exactly like the general ROBOTS tags, but they
will take priority for PicoSearch and will not affect other web search
engines. You could even make the CONTENT say "INDEX,FOLLOW" if you
wanted PicoSearch to index a page that the general ROBOTS tag said to
"NOINDEX,NOFOLLOW".
<!-- picometa name="ROBOTS" content="NOINDEX" -->
<!-- picometa name="ROBOTS" content="NOFOLLOW" -->
<!-- picometa name="ROBOTS" content="NOINDEX,NOFOLLOW" -->
Excluded Paths: To control the skipping of entire files and
directories of your website quickly and easily, be sure to visit your
account management control panel's Exclusions section, under your
Indexing topics. This feature is handy and easy to control, because it
will only affect PicoSearch, and requires no editing of your web pages.
You just enter in your account manager the patterns that will match the
URLs you want to Exclude, with various possibilities as listed below.
The options all combine too, so you can experiment fully. The URLs that
will be matched against will be the fully qualified unique URLs that are
expanded from your file's links during indexing. Note that anchors are
chopped off of your page links, since they jump within a page and aren't
really unique in themselves, so anchors won't match in Excluded Path
patterns.
- Left-to-right patterns: The basic exclusion pattern is a URL
or URL fragment that matches left-to-right. Thus there is an implied
wildcard at the end, so for example the pattern:
http://www.mysite.com/junk_directory/
will exclude all URLs that start with that pattern, including
http://www.mysite.com/junk_directory/ and
http://www.mysite.com/junk_directory/junk1.html
Any
URL link found during indexing that would otherwise be indexed but
matches an exclusion pattern will be excluded from the searchable index,
i.e. skipped by the indexer, along with its ensuing links. The only
exception is an Entry Point, which cannot be excluded entirely, but can
be excluded with links still followed (see the exclude follow pattern
below). Another exception to note is a pattern based not on URLs but the
HTML page titles (see title excludes below).
- Wildcard patterns (*): You can add explicit wildcards anywhere in a pattern using the asterisk *, but as soon as you do then the implicit final wildcard of a left-to-right pattern is removed. Thus, the pattern:
*/junk_directory/
will still exclude
http://www.mysite.com/junk_directory/
as well as something else like
http://www.mysite.com/more/junk_directory/
but not the additional files like
http://www.mysite.com/junk_directory/junk1.html
You can always add a final * to get the full effect again, thus the pattern
*/junk_directory/*
will exclude any URL with /junk_directory/ somewhere inside of it.
- Ended patterns ($): You can remove the implicit final wildcard of a left-to-right pattern by ending it with a dollar sign $.
Adding an explicit wildcard anywhere would also remove the implicit
final wildcard, but then of course you might match other URLs. One use
of the dollar sign is if your server is generating default directory
pages that you don't want searchable, but you still want everything in
the directory to be searched.
Thus, the pattern:
http://www.mysite.com/junk_directory/$
will exactly stop the URL
http://www.mysite.com/junk_directory/
while still allowing
http://www.mysite.com/junk_directory/junk1.html
although that assumes that the junk1.html link was found on some other page, since http://www.mysite.com/junk_directory/ was stopped. See also the see the exclude follow ~ pattern below.
- Case-sensitive patterns ("): The Exclude patterns are
normally case-insensitive, but if you need case-sensitivity then you can
end the pattern with a double quote ". You
might need this if your website has inconsistent usage of capital
letters in the links, and then you discover that your server is
sensitive to capitalization and even serves up slightly different pages
that get past PicoSearch's Remove duplicates controls (in the account
manager's Index Modes section).
Thus, the pattern:
http://www.mysite.com/Junk_Directory/"
will stop the URL
http://www.mysite.com/Junk_Directory/junk1.html
while still allowing
http://www.mysite.com/junk_directory/junk1.html
- Exclude Text patterns (NOINDEX:): The Exclude patterns
normally stop a URL completely, so the text in the page is not
searchable and the links on the page are not followed. You can also do
an Exclude Text pattern by starting it with NOINDEX: or with a tilde ~. This is just like using the NOINDEX option of a Robots Meta Tag or a PicoSearch SKIPTEXT Tag, but you don't have to edit your website (just remember that you made this account manager setting which could effect many pages if it includes wildcards). Thus, the text of the page is skipped but the links are still followed. This modified Exclude pattern can affect an Entry Point, unlike a total Exclude (it wouldn't make sense to completely exclude an Entry Point, since then it just shouldn't be an Entry Point).
So for example, the pattern:
NOINDEX:http://www.mysite.com/junk_directory/
will exclude from the search this URL
http://www.mysite.com/junk_directory/junk1.html
while still evaluating any links referred to by files in the junk
directory (in case it's the only way to reach another part of your
website).
- Exclude Links patterns (NOFOLLOW:): If you have URLs for which you want to only search the text and not follow any links, you can use NOFOLLOW: to start your Exclude pattern. This is just like using the NOFOLLOW option of a Robots Meta Tag or a PicoSearch SKIPLINK Tag, but you don't have to edit your website (just remember that you made this account manager setting which could effect many pages if it includes wildcards). This modified Exclude pattern can affect an Entry Point, unlike a total Exclude (it wouldn't make sense to completely exclude an Entry Point, since then it just shouldn't be an Entry Point).
So for example, the pattern:
NOFOLLOW:http://www.mysite.com/just_search_me_directory/
will search the URL
http://www.mysite.com/just_search_me_directory/just_search_me1.html
without following any links referred to by this or any other file in the just search me
directory.
- Inverted patterns (NOT:): What if it's easier to say what
to keep than what to get rid of? You can invert an exclude pattern by
starting it with NOT: or with an exclamation point !.
This will mean to exclude anything that doesn't match the pattern, which
is very powerful and can combine with other excludes, so be careful.
And if you depend on following links in all files, be sure to use NOINDEX:NOT: or ~!
(the two modifiers combine perfectly well, see exclude follow patterns
above).
Thus, the exclude pattern:
NOT:http://www.mysite.com/junk_directory/
will make a search engine that consists only of the junk directory
files. Note that this example might better be accomplished by using
http://www.mysite.com/junk_directory/
as an Entry Point with Directory Restriction. But other patterns that aren't limited to a single directory could be very useful, such as
NOT:*current*
to keep only files with "current" somewhere in the URL.
- Title patterns (INTITLE:): You can also exclude by matching not on the URL but rather on the HTML title string, which is located between the <title> .... <\title> tags in the HTML header code of your website pages. This is handy if
your site has predictable titles that indicate which pages should not be
searched. To use a title pattern, just start the pattern with INTITLE: and don't expect the implicit final wildcard anymore (since title
strings aren't as left to right matching as URLs).
For example, the
exclude pattern:
INTITLE:*personal page*
will cause your search index to skip any HTML page whose title declares it to be a personal page, such as
<title>Personal Page for Mr. Jones<\title>
Excluded Tags: (for paying accounts only)
By using this powerful feature in your PicoSearch account manager's
Exclusions features, you can make PicoSearch skip (or keep) certain
tagged sections of your files without having to change the files
themselves. This assumes that these sections already have uniquely
identifying HTML tags around them, which can often be the case, thus
saving the time of editing the files to insert PicoSearch Skip Tags.
For example, if you have the same navigation menus on all of
your pages, you probably don't want to keep finding those menus in all
your searches, because that will obscure pages that are really about
those words. It would be best to skip common menus when indexing. So if
your menus are always surrounded by the same tag (with an optional
unique attribute value pair) then you can just enter that in the
Excluded Tag box and reindex. And if you depend on the links in those
menus for finding all pages, use a tilde (~) as described below for an
exclude and follow effect.
Say your menus all have <div id="menu"> .... </div> around them. Entering an Excluded Tag of just <div> would skip everything inside all div tags, which is rather extreme but could be useful. Or entering <div id="menu"> will pick out just your menus, even if they also have other attributes in the way, such as <div class="happy" id="menu"> or <div id="menu" title="Big Menu">.
So you don't need wildcards with Excluded Tags to match over extra
attributes, although you might need wildcards within the key attribute's
value, such as <div title="Big *"> to match all your Big titled div tags (title is a kind of HTML tag comment which can have any text inside the quotes).
Certain HTML tags will lend themselves well to Excluded Tags, such as <div>. and <span> and lists, because they are often used sparingly to define sections of a
page. Of course a closing tag is needed to end the section, so <br> won't work and <p> works only if you remember to end with </p>. You're also welcome to try excluding more common tags like <table>,
just be careful to use an attribute=value pair that is unique and won't
get easily lost in future edits. The attractiveness of the Excluded Tag
is not having to change your pages, but that's also the danger if you
never add a comment to your source and you forget what you told
PicoSearch to skip.
When you enter your Excluded Tags, the account manager will enforce
quotes around attribute values, but PicoSearch will really match without
quotes too (quotes are the strictest HTML but many people forget them
for one-word values). Your account manager's Most Recent Indexing Log
will then say the full tags that are matched per file, to help you
verify that the patterns are working. It will say for example:
[Excluded Tag Span] <div id=menu class=happy>
The order of processing for Excluded Tags is the order in which you
entered them in your account manager. Also, Excluded Tags are applied
only after HTML comments and PicoSearch Skip Tags are stripped. So if you already have a section of your page skipped
because of a comment or prior tag, you don't really need another
Excluded Tag.
-
Exclude Text tags (NOINDEX:): Like with Excluded Paths, you have the
option with Excluded Tags to exclude the text but still follow the links
within any matching sections of your documents. This option is
specified by preceding the tag pattern with a NOINDEX: or wtih a tilde ~. This could
be vital if you are trying to keep the text of repeated navigation
menus out of the search, but of course you still need to follow the navigation links to all pages. And excluded Javascript links will only be followed if
you have the global option to "Allow tags that can block script parsing
per file" turned off, in your account manager's Index Modes section.
So for example, the following Excluded Tag pattern could cut your menus
from the search while still following the menu links: NOINDEX:<div id="menu">
-
Exclude Links tags (NOFOLLOW:): Like with Excluded Paths, you have the
option with Excluded Tags to search the text but not follow the links within any matching sections of your documents. This option is
specified by putting NOFOLLOW in front of the tag pattern. Excluded Javascript links will only be followed if
you have the global option to "Allow tags that can block script parsing
per file" turned off, in your account manager's Index Modes section.
So for example, the following Excluded Tag pattern could be used to search the text in lists of products but not follow the links to the product details: NOFOLLOW:<div id="product-details">
-
Inverted tags (NOT:): We also mentioned the ability to keep sections
by Excluded Tags, so here's the trick. If you put a NOT: or exlamation point ! in front of the
Excluded Tag, then it takes on an inverse meaning. Any page that has an inverted
pattern will keep only that tagged section and get rid of everything
else in the HTML body (so titles and metas are maintained). If no inverted
patterns match, the page is not affected. If more than one inverted pattern
matches, these sections get kept in the order of the inverted patterns that you
entered.
Most people won't even play with any inverted pattern at all,
much less more than one. But if you do try these, then remember to enter
them in the account manager in the order that they come on the page, or
else the text can get a little rearranged (which isn't that big a deal
anyway, it just might be occasionally noticed in the concordance). And
if you depend on following links in excluded tag sections, be sure to
use NOINEDX:NOT: or ~! (the two modifiers combine perfectly well, see Exclude Follow tags
above).
An example of a inverted pattern would be if you only wanted to search the tables with products, so then it might work to use this: NOT:<table id="products"> Your Most Recent Indexing Log would then say matches like this:
[Excluded all but Tag Span] <table id=products border=0>
|
|
 |
|
 |