 |
|
 |
PicoSearch starts from your Entry Points and follows URL links, finding
all the files (documents) that it can, just like any user. It can only
work this way because PicoSearch cannot mind-read your server to know
what files are lying around. PicoSearch works online only, just like a
person visiting your site. Your Entry Points are URLs that can be your
homepages, or webpages of links prepared just for PicoSearch.
If PicoSearch is finding too many or too few files, relax. PicoSearch
is very flexible, and it will serve you well once you determine the
settings you need. There must be a good explanation, and rest assured
you are not the first person to have your problem. We've helped many
people before.
In general, don't be afraid to experiment with PicoSearch's settings to
find just the files you want. You can reindex as often as you like.
The following suggestions will give you ideas to try:
- Check your File and its Content-Type:
Is it a file that your PicoSearch can index? See details on types below.
Is the file correctly formed? For example in HTML,
the <head> and <body> tags must be in the right order for
PicoSearch to index the body text.
The PicoSearch standard is to index HTML and plain text files, as well as other formats.
This means that the HTTP content-type field returned by your site's
server must be "text/plain" or "text/html", which it usually is, or the
correct type for another format. An example problem would be if your
hoster is leaving the HTTP field blank, because it would be hard for you
to notice directly if your browser is compensating, but if PicoSearch
ends up rejecting your webpage for this reason then you would would have
to ask your hoster to simply fix this situation (and they should).
Flash Shockwave sites look nice but can be hard for search engines, see our suggestions. You can index other file types including .pdf, .doc, .xls, .ps, and .rtf,
but most of these are available only to Professional and Premium
Accounts. Some licensed formats like pdf require that we honor
copy-protections, so we cannot index pdf files that have copy protection
turned on. For the latest on free versus paying account file types
available to you, see Can PicoSearch index files besides HTML?.
Also,
note that since PicoSearch works over the internet, we cannot connect
to your database except through the links on your webpages. So
PicoSearch will index your database only if you provide the link calls
to cgi scripts, asp, or other methods that convert the database into
HTML document views.
- Check your Account Page Maximums: Don't forget that as a Free
Account you get only 250 pages total for your search engine account,
where one URL document generally equals one PicoSearch page (extra-long
HTML/text, or multi-page non-HTML formats like PDFs, may yield more than
1 PicoSearch page per document). Professional and Premium Accounts get
3000 and 6000 pages respectively (you may contact us for pro-rated increments above that).
Now
that PicoSearch supports Guest URLs with individual maximums and Portal
Lists with maximum pages for each link of a set of links, these limits
must also be considered. For more information on how these mini web
portal construction tools work, see our FAQs starting with the
announcement What's New? Include Guest URLs and become a mini Web Portal!
- Check the Indexing Progress:
Is PicoSearch done indexing? Did it find a link to the page?
Your
most valuable tool will be the indexing progress reporting, seen either
in online indexing (if online doesn't take too long for you), or the
"Most Recent Indexing Log" of any paid account's reporting section. All
accounts have the "List of Documents" in the account manager to show
exactly which files PicoSearch has found. Look here for what happened
to files that you are missing or have extra of.
Tip: If PicoSearch is trying to index extra
urls that you don't really want to search, try the "Excluded Paths"
feature in your account manager to skip over unwanted urls, and then
reindex. Bad urls may happen because of file types that PicoSearch
doesn't need to look at (your server may be saying that non-text files
are type text/html, see document types below), or
because of broken urls that you want to correct on your pages (in this
case PicoSearch is helping you find them), or because of scripts like a
calendar that generate excess urls (future calendar years for example).
Here are some messages that you may see during indexing:
- Duplicate Page: the text of this url has been seen before.
You can try changing the "Remove all duplicate documents" setting in
your account manager's Index Modes. The Duplicate Page message often
means that the same user-friendly warning is being shown on your site
for different urls. It may be an error like "this page doesn't exist",
or a login page (see Authorizations below).
- Excluded URL: PicoSearch was told to skip this document, see Hidden HTML Tricks below.
- Link Out of Bounds: PicoSearch is reminding you that this URL may be close to your Entry Points but it is beyond the current Link Restrictions, see below.
- Forbidden (403): your server doesn't want to let PicoSearch have this page. Check your Authorizations below.
- Not Found (404): your server doesn't have this page, it's a bad link.
- Internal Server Error (500): your server isn't serving the
page. If this is happening intermittently, maybe your server wants
PicoSearch to ask for pages more slowly. Set the "Delay between
document fetches" in your account manager's Index Modes.
- No Data Found: there wasn't any text in the page. If you're a
paying account then you can index PDFs, but if it's all scanned
graphics then you could see this message. See Document types below.
- Check your Entry Points: Your Entry Point URLs start
PicoSearch on its journey through your website. PicoSearch can only
follow links starting from your own webpages. This powerful feature
should be sufficient, and it will help you see your website from your
visitors' point of view.
Entry Points must be fully-qualified URLs, like "http://www.mysite.com/". If the Entry Point is a directory
then it has to end in a slash (/), and this has to bring up an actual
file in a browser (most likely it'll be index.html). An Entry Point can
also be a specific file anywhere in your site of course, just be sure
that it has no ending slash then, and if it's supposed to get to more
files then it has the necessary URL links.
If you're a Free PicoSearch account, then
you must be frugal with your Entry Points, since you only get 3 of them,
while the Professional and Premium accounts can have any number. Entry
Points always get indexed, and they determine all the documents going
forward because (1) they contain the links that will be followed, and
(2) the names of the Entry Points limit the range of the search engine
(see Link Restrictions).
- Check your Directory, Server, or Domain (Link) Restrictions:
Your site may be connected to many other sites through its links. Link
restrictions are for keeping your search engine within just the internet
addresses that you intend.
If you are missing files (or getting too
many), check that the links are literally and exactly within the range
of your Entry Points as prescribed by your link restrictions. "Directory
Restriction" means at or below the directory of your Entry Points,
"Server" is below the full http address (http://www.dot.com/), and
"Domain" is anything under your domain (dot.com). None of these allow
your search engine to go into other websites automatically - for that,
you need to use a Portal list or the extra entry-points of a Professional account.
So for example, if you've been loose in
referring to your links with and without the "www." in the URLs, you'll
either need to use domain restriction, or stay with directory
restriction and add both address variations as Entry Points. Otherwise
you may get the message [Link Out of Bounds] when you reindex, which mean this url is close but not close enough to match.
- Check your Links: PicoSearch will only find the documents
(files) that it can get to by the URL links that start from your Entry
Points. So if webdocuments aren't linked, then they won't get indexed.
If links are broken or have dubious syntax (like "/../../index.html")
the addresses may not resolve to a document.
Some
links may not get indexed if the server for them is too slow or if it
is down frequently. You may see this in PicoSearch's indexing output as
the document being tried but then never visited and added, or a "not
found" warning, or a duplicate document that is removed (the duplicate
document being the same minimal error message over and over). If a URL
is consistently a problem for your crawler, you may wish to remove it.
If some URLs are just intermittently not found, you may wish to force
PicoSearch to retry Not Found documents by setting that option in the Index Modes section of your account manager.
If
you are using a hyperlinking style that is much trickier than plain
HTML, PicoSearch may not be able to follow the links. PicoSearch is
primarily an HTML reader, and unfortunately the trend is toward some
very obtuse linking styles, including some that are automatically
embedded by certain you-thought-this-would-make-everything-better
webpage editors. If PicoSearch needs help to see your links, then you
should try the ADDSEARCHLINKS tag, which is easy to use.
- Check your Host's Address Resolutions: Some web hosters can
do funny things with the addresses of your webpages. You type in a
simple domain name, and suddenly your browser is really seeing a long
funny name that, if PicoSearch ignores it, isn't so funny anymore. You
may also have mirror sites in your links, or machine load balancing
solutions that play with the server names (so that
http://eniac1.mysite.com and http://eniac2.mysite.com really go to the
same site). To handle these cases, you probably have to add some extra
Entry Points. These are not so much to reach new documents (keep
'duplicate removal' turned on) but just to tell PicoSearch what name
variations to expect. We've even seen a web hoster who generated new
names for a person's links every time we visited the site - names that
would expire in a matter of minutes! It was a mess not just for
PicoSearch but any visitor hoping to set some bookmarks to the pages, so
naturally we advised the webmaster to switch hosters. There are more
fish in the sea, you know (see reviewers like CNET for recommendations on good web hosting).
- Check your Authorizations: PicoSearch is like any visitor
with a browser. If you can't see a webpage in your browser, how is
PicoSearch going to index it? Well it can't, and the most common
problem here is that people forget that parts of their site are password
protected. Fortunately, PicoSearch can be instructed to index
directories that are protected by the standard HTTP protocol, as well as most cookies and ASP.
Just give PicoSearch the directory name with a login name and password
in your account manager, and index your site again. Note that of
course this is all done over the Internet, so if you have a firewall or
intranet that doesn't let people see you from the outside, then
PicoSearch can't index you either.
- Check your Internet Connection: Well maybe this is obvious,
but if your internet connection is down or overloaded, then PicoSearch
is going to have problems reaching you. This point is important to
consider if you are trying to index at a very busy time of day. You
might want to try another time, especially if you are a paid account
setting your scheduled reindexer. And if you are building a mini-portal
with a huge portal list, your grab bag of URLs may include some nasties
that are slowing you down too (an on-line indexing could show up the
culprit, if you have the patience.)
A less obvious
problem would be if you can connect to your site, but PicoSearch can't.
This may happen if your hoster and server are blocking all user agents
except for the standard internet browsers. This means that while your
Netscape or Explorer is allowed to see your site, the PicoSearch robot
can't. If you try online indexing and get strange messages like [Forbidden] for your documents, tell your hoster that they must allow the robot "PicoSearch/1.0"
- Check your Hidden HTML Tricks: Then there are the things
people do and forget about, because the end result looks fine in a
browser. PicoSearch is not exactly a browser however. So problems can
include:
- adding tags that cause PicoSearch to skip a webpage, and then
wondering why that webpage and all its linked children are missing (see file skipping controls that PicoSearch is sensitive to).
- relying on client-side Javascript webpage generation (ie. the
browser is supposed to make the HTML page at the last possible minute).
Then PicoSearch will only see the Javascript, and you may have to
invisibly add text back in just for PicoSearch to pick up.
- if nothing is getting indexed at all, maybe your whole website is offlimits according to the robots.txt file,
which PicoSearch obeys. In that case, you may have to contact your
webhoster or server team to add PicoSearch to the authorized visitors
list.
- using a URL redirection that PicoSearch can't get past, because
maybe it has fancy browser-dependent logic that PicoSearch will not
follow (quick solution: make another Entry Point for the next link).
- using framesets that cascade extensively, resulting in extra
indexed documents for every frame (solution: either the frameset
webpages don't matter because they say nothing anyway so they won't come
up in searches, or if they do come up as distractions then you should
use a content skipping tag, but cautiously so as to not miss a link).
We also have FAQs on how you might deliberately skip or add certain documents. And if you need further assistance, please specify
the exact nature of the problem (which documents you think are missing,
which webpages have those links) and contact us.
We like to empower our users, so for example we can make a Professional
"Most Recent Indexing Log" available even to a free account if they are
having difficulties, so they can track their own offline indexing
precisely.
|
|
 |
|
 |