Help with PicoSearch

Why didn't PicoSearch find all of my files (documents)?

PicoSearch starts from your Entry Points and follows URL links, finding all the files (documents) that it can, just like any user. It can only work this way because PicoSearch cannot mind-read your server to know what files are lying around. PicoSearch works online only, just like a person visiting your site. Your Entry Points are URLs that can be your homepages, or webpages of links prepared just for PicoSearch.

If PicoSearch is finding too many or too few files, relax. PicoSearch is very flexible, and it will serve you well once you determine the settings you need. There must be a good explanation, and rest assured you are not the first person to have your problem. We've helped many people before.
 
In general, don't be afraid to experiment with PicoSearch's settings to find just the files you want. You can reindex as often as you like. The following suggestions will give you ideas to try:
  1. Check your File and its Content-Type:
       Is it a file that your PicoSearch can index? See details on types below.
       Is the file correctly formed? For example in HTML, the <head> and <body> tags must be in the right order for PicoSearch to index the body text.
       The PicoSearch standard is to index HTML and plain text files, as well as other formats. This means that the HTTP content-type field returned by your site's server must be "text/plain" or "text/html", which it usually is, or the correct type for another format. An example problem would be if your hoster is leaving the HTTP field blank, because it would be hard for you to notice directly if your browser is compensating, but if PicoSearch ends up rejecting your webpage for this reason then you would would have to ask your hoster to simply fix this situation (and they should).
       Flash Shockwave sites look nice but can be hard for search engines, see our suggestions. You can index other file types including .pdf, .doc, .xls, .ps, and .rtf, but most of these are available only to Professional and Premium Accounts. Some licensed formats like pdf require that we honor copy-protections, so we cannot index pdf files that have copy protection turned on. For the latest on free versus paying account file types available to you, see Can PicoSearch index files besides HTML?.
       Also, note that since PicoSearch works over the internet, we cannot connect to your database except through the links on your webpages. So PicoSearch will index your database only if you provide the link calls to cgi scripts, asp, or other methods that convert the database into HTML document views.

  2.  
  3. Check your Account Page Maximums: Don't forget that as a Free Account you get only 250 pages total for your search engine account, where one URL document generally equals one PicoSearch page (extra-long HTML/text, or multi-page non-HTML formats like PDFs, may yield more than 1 PicoSearch page per document). Professional and Premium Accounts get 3000 and 6000 pages respectively (you may contact us for pro-rated increments above that).
        Now that PicoSearch supports Guest URLs with individual maximums and Portal Lists with maximum pages for each link of a set of links, these limits must also be considered. For more information on how these mini web portal construction tools work, see our FAQs starting with the announcement What's New? Include Guest URLs and become a mini Web Portal!

  4.  
  5. Check the Indexing Progress:
        Is PicoSearch done indexing? Did it find a link to the page?
        Your most valuable tool will be the indexing progress reporting, seen either in online indexing (if online doesn't take too long for you), or the "Most Recent Indexing Log" of any paid account's reporting section. All accounts have the "List of Documents" in the account manager to show exactly which files PicoSearch has found. Look here for what happened to files that you are missing or have extra of.
        Tip: If PicoSearch is trying to index extra urls that you don't really want to search, try the "Excluded Paths" feature in your account manager to skip over unwanted urls, and then reindex. Bad urls may happen because of file types that PicoSearch doesn't need to look at (your server may be saying that non-text files are type text/html, see document types below), or because of broken urls that you want to correct on your pages (in this case PicoSearch is helping you find them), or because of scripts like a calendar that generate excess urls (future calendar years for example).

        Here are some messages that you may see during indexing:
    • Duplicate Page: the text of this url has been seen before. You can try changing the "Remove all duplicate documents" setting in your account manager's Index Modes. The Duplicate Page message often means that the same user-friendly warning is being shown on your site for different urls. It may be an error like "this page doesn't exist", or a login page (see Authorizations below).

    • Excluded URL: PicoSearch was told to skip this document, see Hidden HTML Tricks below.

    • Link Out of Bounds: PicoSearch is reminding you that this URL may be close to your Entry Points but it is beyond the current Link Restrictions, see below.

    • Forbidden (403): your server doesn't want to let PicoSearch have this page. Check your Authorizations below.

    • Not Found (404): your server doesn't have this page, it's a bad link.

    • Internal Server Error (500): your server isn't serving the page. If this is happening intermittently, maybe your server wants PicoSearch to ask for pages more slowly. Set the "Delay between document fetches" in your account manager's Index Modes.

    • No Data Found: there wasn't any text in the page. If you're a paying account then you can index PDFs, but if it's all scanned graphics then you could see this message. See Document types below.

  6. Check your Entry Points: Your Entry Point URLs start PicoSearch on its journey through your website. PicoSearch can only follow links starting from your own webpages. This powerful feature should be sufficient, and it will help you see your website from your visitors' point of view.
        Entry Points must be fully-qualified URLs, like "http://www.mysite.com/". If the Entry Point is a directory then it has to end in a slash (/), and this has to bring up an actual file in a browser (most likely it'll be index.html). An Entry Point can also be a specific file anywhere in your site of course, just be sure that it has no ending slash then, and if it's supposed to get to more files then it has the necessary URL links.
        If you're a Free PicoSearch account, then you must be frugal with your Entry Points, since you only get 3 of them, while the Professional and Premium accounts can have any number. Entry Points always get indexed, and they determine all the documents going forward because (1) they contain the links that will be followed, and (2) the names of the Entry Points limit the range of the search engine (see Link Restrictions).

  7.  
  8. Check your Directory, Server, or Domain (Link) Restrictions: Your site may be connected to many other sites through its links. Link restrictions are for keeping your search engine within just the internet addresses that you intend.
        If you are missing files (or getting too many), check that the links are literally and exactly within the range of your Entry Points as prescribed by your link restrictions. "Directory Restriction" means at or below the directory of your Entry Points, "Server" is below the full http address (http://www.dot.com/), and "Domain" is anything under your domain (dot.com). None of these allow your search engine to go into other websites automatically - for that, you need to use a Portal list or the extra entry-points of a Professional account.
        So for example, if you've been loose in referring to your links with and without the "www." in the URLs, you'll either need to use domain restriction, or stay with directory restriction and add both address variations as Entry Points. Otherwise you may get the message [Link Out of Bounds] when you reindex, which mean this url is close but not close enough to match.

  9.  
  10. Check your Links: PicoSearch will only find the documents (files) that it can get to by the URL links that start from your Entry Points. So if webdocuments aren't linked, then they won't get indexed. If links are broken or have dubious syntax (like "/../../index.html") the addresses may not resolve to a document.
        Some links may not get indexed if the server for them is too slow or if it is down frequently. You may see this in PicoSearch's indexing output as the document being tried but then never visited and added, or a "not found" warning, or a duplicate document that is removed (the duplicate document being the same minimal error message over and over). If a URL is consistently a problem for your crawler, you may wish to remove it. If some URLs are just intermittently not found, you may wish to force PicoSearch to retry Not Found documents by setting that option in the Index Modes section of your account manager.
        If you are using a hyperlinking style that is much trickier than plain HTML, PicoSearch may not be able to follow the links. PicoSearch is primarily an HTML reader, and unfortunately the trend is toward some very obtuse linking styles, including some that are automatically embedded by certain you-thought-this-would-make-everything-better webpage editors. If PicoSearch needs help to see your links, then you should try the ADDSEARCHLINKS tag, which is easy to use.

  11.  
  12. Check your Host's Address Resolutions: Some web hosters can do funny things with the addresses of your webpages. You type in a simple domain name, and suddenly your browser is really seeing a long funny name that, if PicoSearch ignores it, isn't so funny anymore. You may also have mirror sites in your links, or machine load balancing solutions that play with the server names (so that http://eniac1.mysite.com and http://eniac2.mysite.com really go to the same site). To handle these cases, you probably have to add some extra Entry Points. These are not so much to reach new documents (keep 'duplicate removal' turned on) but just to tell PicoSearch what name variations to expect. We've even seen a web hoster who generated new names for a person's links every time we visited the site - names that would expire in a matter of minutes! It was a mess not just for PicoSearch but any visitor hoping to set some bookmarks to the pages, so naturally we advised the webmaster to switch hosters. There are more fish in the sea, you know (see reviewers like CNET for recommendations on good web hosting).

  13.  
  14. Check your Authorizations: PicoSearch is like any visitor with a browser. If you can't see a webpage in your browser, how is PicoSearch going to index it? Well it can't, and the most common problem here is that people forget that parts of their site are password protected. Fortunately, PicoSearch can be instructed to index directories that are protected by the standard HTTP protocol, as well as most cookies and ASP. Just give PicoSearch the directory name with a login name and password in your account manager, and index your site again. Note that of course this is all done over the Internet, so if you have a firewall or intranet that doesn't let people see you from the outside, then PicoSearch can't index you either.

  15.  
  16. Check your Internet Connection: Well maybe this is obvious, but if your internet connection is down or overloaded, then PicoSearch is going to have problems reaching you. This point is important to consider if you are trying to index at a very busy time of day. You might want to try another time, especially if you are a paid account setting your scheduled reindexer. And if you are building a mini-portal with a huge portal list, your grab bag of URLs may include some nasties that are slowing you down too (an on-line indexing could show up the culprit, if you have the patience.)
       A less obvious problem would be if you can connect to your site, but PicoSearch can't. This may happen if your hoster and server are blocking all user agents except for the standard internet browsers. This means that while your Netscape or Explorer is allowed to see your site, the PicoSearch robot can't. If you try online indexing and get strange messages like [Forbidden] for your documents, tell your hoster that they must allow the robot "PicoSearch/1.0"

  17.  
  18. Check your Hidden HTML Tricks: Then there are the things people do and forget about, because the end result looks fine in a browser. PicoSearch is not exactly a browser however. So problems can include:
    • adding tags that cause PicoSearch to skip a webpage, and then wondering why that webpage and all its linked children are missing (see file skipping controls that PicoSearch is sensitive to).
    • relying on client-side Javascript webpage generation (ie. the browser is supposed to make the HTML page at the last possible minute). Then PicoSearch will only see the Javascript, and you may have to invisibly add text back in just for PicoSearch to pick up.
    • if nothing is getting indexed at all, maybe your whole website is offlimits according to the robots.txt file, which PicoSearch obeys. In that case, you may have to contact your webhoster or server team to add PicoSearch to the authorized visitors list.
    • using a URL redirection that PicoSearch can't get past, because maybe it has fancy browser-dependent logic that PicoSearch will not follow (quick solution: make another Entry Point for the next link).
    • using framesets that cascade extensively, resulting in extra indexed documents for every frame (solution: either the frameset webpages don't matter because they say nothing anyway so they won't come up in searches, or if they do come up as distractions then you should use a content skipping tag, but cautiously so as to not miss a link).


We also have FAQs on how you might deliberately skip or add certain documents. And if you need further assistance, please specify the exact nature of the problem (which documents you think are missing, which webpages have those links) and contact us. We like to empower our users, so for example we can make a Professional "Most Recent Indexing Log" available even to a free account if they are having difficulties, so they can track their own offline indexing precisely.

Back to FAQs