Help with PicoSearch

Can PicoSearch index and search files besides HTML? (PDF, DOC, ASP, etc.)

Yes, PicoSearch can search many formats.
 
To begin with, PicoSearch will index any plain text or HTML file. Technically, this means files that come over the http protocol as being of content-type "text/html" or "text/plain". This should cover pages that come from ASP and other scripts.
 
Additionally, PicoSearch keeps getting smarter about more file types! Here is a current list of what PicoSearch can index. Remember that your search engine will also be limited to a maximum page limit, where one URL document (file or webpage) generally equals one PicoSearch page (extra-long HTML/text, or multi-page non-HTML formats like PDFs, may yield more than 1 PicoSearch page per document). If you find yourself needing more pages, see our plan rates and services.
 
For All Accounts
(Free as well as Professional and Premium Accounts!)
  • HTML files (.html types including .htm and .shtml, or content-type "text/html")
    PicoSearch will index your HTML files, including any generated by addresses to server scripts like ASP, CGI Perl, etc. This feature is on by default. You can turn on/off the titles, meta-tags, image alt tags, and even the whole page's body to get different searching effects - see the Index Modes section of your Account Manager's indexing topics. (If your HTML files aren't indexing as you expected, consider the FAQ on Finding all of your Pages

  • Plain Text files (.txt, or content-type "text/plain")
     PicoSearch will index your plain text files. This feature is on by default, and you can turn it off in the Index Modes section of your Account Manager's indexing topics.

  • XML files (.xml, or content-type "text/xml")
     PicoSearch will index the text (not tags) of your XML files. This feature is on by default, and you can turn it off in the Additional Formats section of your Account Manager's indexing topics.
    Note: Delivery of results in XML with DTD is a separate feature that is available for paid accounts, see FAQ.

  • Flash Shockwave Files (.swf or content-type "application/x-shockwave-flash")
     PicoSearch will follow your Shockwave Flash file links, so you can create exciting navigation for your site using Adobe Macromedia Flash tools. PicoSearch will also try to extract the text fields of your Shockwave files, which may result in additional pages.
          We do not recommend building your site entirely in Flash, because this often makes it impossible to search, even though PicoSearch uses the latest official Flash software to read your files. The best use of Flash for compatibility with search engines is to make Flash parts within HTML files. If you do have a site that is all Flash and the text isn't being found, PicoSearch will search any HTML content found between the <noscript> ... </noscript> tags, so those tags are a good practice for a site that cannot be easily redesigned for greater compatibility.
          Flash Shockwave text indexing can be turned on/off in the Additional Formats section of your Account Manager's indexing topics.

  • MP3 Files (.mp3 or content-type "audio/mpeg")
     PicoSearch will index the song title, artist, album, and other text tags in your MP3 files which have been created by the ID3 tag format v1.0 and v1.1. This feature is on by default, and you can turn it off in the Additional Formats section of your Account Manager's indexing topics.

  • MIDI Files (.midi or .mid or content-type "audio/midi")
     PicoSearch will index your MIDI files in two ways. One, the name of the file will be indexed, as "song name: filename.mid". Second, the text events of the MIDI standard will all be indexed. These are the codes 1-7 respectively that are used for a general text event, copyright info, track name, track instrument name, lyric, marker, and cue. This feature is on by default, and you can turn it off in the Additional Formats section of your Account Manager's indexing topics.


For Professional and Premium Accounts only
  • Adobe PDF (.pdf or content-type "application/pdf")
     PicoSearch will index the text of your Adobe Acrobat PDF documents. Title and meta properties will be picked up where possible for display in search results, or can be controlled from the parent link.
       If the title property is blank then the file name should get used. You may see strange default titles if you are exporting from another application, in which case you need to set the Adobe title property to something better. The Adobe title and keywords of the document properties can also become part of the searchable document, see the options in the Index Modes section of your account manager. PDF indexing is on by default, and you can turn it off in the Additional Formats section of your Account Manager's indexing topics.
       Maximum Pages: PDFs count for the number of pages and can stretch account page limits. This policy had to be enforced because too many sites were hosting massive PDFs of hundreds or even thousands of pages each. To help you control your PDFs, there is page limiter option available in the Additional Formats section of the account manager. NOTE: If you are still having difficulty controlling your PDF sizes, please contact us for an alternate average page length counting formula that may be to your benefit.
       Copy Protection: By default PicoSearch will also honor the Acrobat security profile with which your files have been saved, and will not index files that you have copy-protected. You have a separate option to include copy-protected PDFs. Thus, two common reasons for why a PDF file yeilds no content is if it is all graphical, or it is copy-protected and the option to include copy-protected PDFs is off.

  • MS Word (.doc or content-type "application/msword")
    MS Word Office Open XML (.docx "application/vnd.openxmlformats...")
     PicoSearch will index the text of your MicroSoft Word documents, including Office 2007. This feature is on by default, and you can turn it off in the Additional Formats section of your Account Manager's indexing topics. Title and meta properties will be picked up where possible for display in search results (subject becomes meta description, keywords become meta keywords), or can be controlled from the parent link.

  • MS Excel (.xls or content-type "application/msexcel")
    MS Excel Office Open XML (.xlsx "application/vnd.openxmlformats...")
     PicoSearch will index the text of MicroSoft Excel spreadsheets, including Office 2007. This feature is on by default, and you can turn it off in the Additional Formats section of your Account Manager's indexing topics. Title and meta properties will be picked up where possible for display in search results (subject becomes meta description, keywords become meta keywords), or can be controlled from the parent link.

  • MS PowerPoint (.ppt or content-type "application/mspowerpoint")
    MS PowerPoint Office Open XML (.pptx "application/vnd.openxmlformats...")
     PicoSearch will index the text of MicroSoft PowerPoint presentations, including Office 2007. This feature is on by default, and you can turn it off in the Additional Formats section of your Account Manager's indexing topics. Title and meta properties will be picked up where possible for display in search results (subject becomes meta description, keywords become meta keywords), or can be controlled from the parent link.

  • Rich Text Format (.rtf or content-type "text/rtf" or "application/rtf")
     PicoSearch will index the rich text format, commonly used in MicroSoft applications. This feature is on by default, and you can turn it off in the Additional Formats section of your Account Manager's indexing topics. Title and meta properties will be picked up where possible for display in search results, or can be controlled from the parent link.

  • Adobe PostScript (.ps or content-type "application/postscript")
     PicoSearch will index the text of your Adobe PostScript documents. This feature is on by default, and you can turn it off in the Additional Formats section of your Account Manager's indexing topics. Title and meta properties will be picked up where possible for display in search results, or can be controlled from the parent link.

Title and Meta Trick: For non-HTML documents you may have special application attributes that are not in the body of the document. PicoSearch can usually index PDF titles and metas with no problem, but for other formats it may not pick up such things. Then the title will default to the URL (see switch "Show Just File Name if Title Defaults to URL" under Configure Results in account manager), the meta description will be the first few lines of the document, and the keywords found will be those in the text. But since you can also set titles and metas from the parent link, you can make a reference page of links to non-HTML formats with titles and metas as you want them to be displayed in PicoSearch, and make this the first Entry Point for full central control.



Back to FAQs