 |
|
 |
Character set becomes an important issue only if it's not working for you. Most world languages are going to be indexed just fine by PicoSearch,
so you won't even have to think about it. Many languages have been
translated for PicoSearch results displaying; see the Results Language
setting of your account manager, and please feel free to contact us to request a new language.
PicoSearch will search all single-byte character set languages. This includes the non-Asian languages, and some Asian languages as well.
But Asian languages which have hundreds or thousands of glyphs must
use double-byte character sets, and these are supported individually by
PicoSearch only as they are developed, including for the concordance
results. Check the Alternate Character Options section of the Indexing
Topics in your Account Manager for these major choices as they become
available.
Single-byte non-Western: Usually you'll be fine. If you have a
language with a majority of non-western characters, such as Arabic or
Cyrillic or Hebrew, you may find that the Exact Phrase searching mode
works best to prevent extra results, see Any/All/Exact Initializing.
Also, if you intend to search other than ISO-Latin1 characters in PDFs
or other special filtered formats, be sure to test the searching first
to your satisfaction. You can request a trial version of PicoSearch
Professional to do this, just contact us.
UTF-8 Support: UTF-8 Unicode
is increasingly popular with hosters because it can include any language
rather transparently. The cost of this is that UTF-8 is not just
another single-byte character set; plain Western characters are single
byte, but accented characters take a varying number of bytes. So it may
look good in a browser, but the actual language is less specified, and
language sensitive software like search engines may have more (not
fewer) problems.
For maximum search compatibility, PicoSearch currently supports UTF-8 by
automatic conversion to an equivalent single-byte character set. This
conversion will be transparent to your searchers and only be used during
the search results display, so paying accounts that use UTF-8 on their
website should specify PicoSearch's equivalent set in the template's
http-equiv meta, or simply leave out that meta entirely so the browser
can decide. The default conversion for Western European languages will
be the ISO-8859-1 set, so languages like French, German, Spanish, and
Italian all work perfectly when the search results are displayed, and
the links go back to your site normally. For non-West European
languages that aren't handled by ISO-8859-1, such as Russian, Arabic,
and Hebrew, if you set your account manager's Results Language display
choice before indexing, then UTF-8 will be converted to an appropriate
ISO or Windows set. Your account manager's Alternate Character Options
section will say what character set was decided upon by PicoSearch. Only
the purely double byte languages like Chinese and Japanese won't work
even though UTF-8 made them seem as easy as a one byte language.
Funny characters in the results? If you have some UTF-8
characters mixed into your web pages (and your editor may do it for
accents without telling you), then your pages should be declared as
UTF-8. If you have no declaration or an incorrect declaration,
PicoSearch may conclude that your pages are ISO and you may find some
funny characters in the search results. It doesn't help that browsers
tend to hide this and many other HTML problems, so your site could be
developed for a while before you realize something is actually
incorrect. To fix the funny characters, reindex after doing one of the
following:
- make sure you have the following UTF-8 meta equivalent declaration in the head of your HTML:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
- OR if you stick with ISO for your page then make sure your accented
characters are safely unambiguous, either by being in the same ISO set
or, best of all, by using the fool-proof HTML character entities like
é for é (the e acute accent).
A note on the Euro sign €: The Euro sign is historically unusual
since it is both a recently invented character and very common, so most
browsers are forgiving regardless of character set. The best way to
display the Euro sign in your HTML is with the spelled character entity
€ no matter what character set you're in. By typing €
then you won't have to worry that technically the Euro sign is only in
the ISO-8859-15 set which mostly copies ISO-8859-1, and a UTF-8 Euro
technically isn't the same as an ISO Euro.
Determining Character Set: To correctly break apart your text
into all the component words by distinguishing letters from punctuation,
PicoSearch needs to decide what your dominant character set is at
indexing time. Character set is bigger than language, so don't worry
that this means you can't be multi-lingual; it just means that you have
to choose the right character set for your languages. Browsers need to
know this information too, so if you're designing a non-English,
non-West European site then you probably already know which character
set to use and where to add it to your web pages. When displaying
search results, PicoSearch will insert the charset for the browser in
the output HTML page of Free accounts. If you have a paying account,
you'll have to put the charset codes that you want in your Customize
Template section, since the HTML page design is under your control.
PicoSearch will pick up the character set of your site's
pages as specified in any one of the following ways. In these examples,
ISO-8859-1 is the most common set which works for English and West
European languages, and it is the default but it also never hurts to
state it explicitly too.
- HTTP Equivalents in the HTML head
- <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" />
(most common)
- <meta http-equiv="charset" content="ISO-8859-1" />
(less common)
- <meta charset="ISO-8859-1" />
(Internet Explorer only, not recommended)
- Server specified
In the encoding field that the HTTP equivalent overrides.
The following list is the character sets that PicoSearch knows about,
providing that the set is specified for your pages as mentioned above.
If you are using an unknown or unspecified set, ISO-8859-1 will be used
by default. A symptom of PicoSearch not using the right set would be
finding a non-Western character individually. PicoSearch will tell you
the character set it used for your index in the Alternate Character
Options section of your Account Manager. If you have any problems,
please just contact us.
- Western European, ISO Latin1 (ISO-8859-1)
most versatile, includes English, Spanish, French, German, Italian, Portugeuse, Dutch, Danish, Swedish, Catalan, and more
- Central European, ISO Latin2 (ISO-8859-2)
covers the Slavic languages and more, including Czech, Hungarian, Polish, Romanian, and Croatian
- South European, ISO Latin3 (ISO-8859-3)
special set for Maltese and some others
- North European, ISO Latin4 (ISO-8859-4)
for Estonian and Baltic languages including Lithuanian, Latvian, and Lappish
- Cyrillic, ISO (ISO-8859-5)
- Arabic, ISO (ISO-8859-6)
- Greek, ISO (ISO-8859-7)
- Hebrew, ISO (ISO-8859-8)
- Turkish, ISO Latin5 (ISO-8859-9)
- Nordic, ISO Latin6 (ISO-8859-10)
- Thai, ISO Latin/Thai (ISO-8859-11)
- Baltic Rim, ISO Latin7 (ISO-8859-13)
- Celtic, ISO Latin8 (ISO-8859-14)
- "Euro" Western European, ISO Latin9 (ISO-8859-15)
- South-Eastern Europe, ISO Latin10 (ISO-8859-16)
- Central European, Win Latin2 (windows-1250)
- Slavic, Windows Cyrillic (windows-1251)
- Western European, Win Latin1 (windows-1252) compare to ISO Latin1
- Greek, Windows (windows-1253)
- Turkish, Win Latin5 (windows-1254)
- Hebrew, Windows (windows-1255)
- Arabic, Windows (windows-1256)
- Baltic, Windows (windows-1257)
|
|
 |
|
 |