Handling Multiple Encodings

In some environments and for some purposes it may be necessary to process files in multiple non-Unicode encodings.  For example in Japan, a fax bureau might receive broadcast lists either in Unicode (UTF-16 or UTF-8) or in SJIS encoding.  Text and HTML files to be e-mailed may or may not have a BOM.

To deal with this situation, CopiaFacts provides control variables (CP_...) which specify what code pages are to be processed and how they are to be detected.  Currently, the available variables are:

CP_COPIAEDITControls the automatic detection of encoding in files other than CopiaFacts command files, when loaded by COPIAEDIT.  To enable this feature in CopiaEdit requires this variable to be defined in FAXFACTS.CFG, and also in the COPIAEDIT Options menu.  When enabled, COPIAEDIT also allows changing the file encoding before saving, again not including the encoding of command files, which are always saved in the encoding specified by $unicode.
CP_EMAILHTMLATTACHControls the automatic detection of encoding in e-mail attach files with content type text/html (which may need to be processed for embedded variables).
CP_EMAILBODYControls the automatic detection of encoding in e-mail body and e-mail alternate body files.
CP_EMAILPSUBJECTControls the automatic detection of encoding on files containing the e-mail subject line (specified on an EMAIL_PSUBJECT variable).
CP_HTMLEXPANDControls the automatic detection of encoding in an HTML job document file for which variable content is to be expanded before processing (specified by job options keyword HtmlExpandVars).
CP_LISTControls the processing of text-format broadcast lists (not Excel or dBase) in Job Administration and FFBC.  The text to be checked is the entire list file.
CP_GCOVERControls the processing of text files to be loaded for use with an ASCII_TEMPLATE GCT or GTT file.  Such files need not, of course, be only ASCII, and nor need the GCT/GTT file be specified only as a cover sheet.  The text to be checked is the entire text file.

The variable applies to any operation during which the command file in which it is defined is in use.  For example CP_LIST may be placed in a UJP file, or CP_GCOVER in an FS file.  The check is also exposed a CheckCP Job Administration DLL function.

It is strongly recommended that you should check the detection specified by these varables against typical examples of the files you expect to handle.

The value of the variable is one or more of the following keywords, separated by either white space or commas.  The sequence of the keywords is significant.

BOMCauses a supported byte-order mark to be treated as valid and indicative of the encoding of the text, which is not checked further. Only the first few bytes of the file need to be examined, so this check is very fast if done first.
ASCIIChecks to see if the text is a valid ASCII file (contains no high-bit characters).  Since ASCII files are also detected by the UTF8 check (and can also be processed as utf-8), this check can be omitted if you are also checking for utf-8.  This check reads the whole file, but is simple and quick to perform.
UTF8Checks to see if the text is a valid UTF-8 file. Note that a pure ASCII file with no high-bit characters will pass this test (and will be detected as ASCII), since ASCII is a subset of UTF-8. This check is relatively fast and will reliably exclude character sets that are not UTF-8, since it requires every high-bit character in the file to be part of a valid UTF-8 sequence.
HTMLChecks the first 2048 bytes for the presence of a 'charset=' specifier and compares it against the list supported for $email_charset, plus the values utf-16 and us-ascii.  This check first does a BOM check, if you have not specified one, because it is not unknown for a file to include 'charset=utf-8' but to be saved as a utf-16 file.
UTF16This is a special check for UTF-16 broadcast lists which are expected to contain telephone numbers (and have not already been selected as having a BOM). It checks the first 200 characters of the file for sequences of 8 or more UTF-16 (BE or LE) numbers encoded as Unicode 0030 to 0039, with possible embedded spaces, hyphens parentheses and periods. More than one such sequence will be taken as indicating a UTF-16 file.  This is a reliable and quick check because numbers are not encoded like this (ASCII equivalent with intervening null bytes) in any other encoding. However it is not suitable for normal text files or e-mail broadcast lists. General UTF-16 text can be selected using CP1200.
CONFnnnSets the confidence threshold for Windows Code Page Detection on subsequent checks.  If nn is not specified, it defaults to 65.  The confidence level is not a percentage, because it can give a relative confidence level and may sometimes exceed 100; but in practice it can be thought of as a percentage.
CPnnnnnUses Windows Code Page Detection to determine whether the text is likely to be encoded with the specified Microsoft Windows Code Page identifier.  For example CP932 will check for Japanese SJIS.  If the confidence level reaches the specified threshold, the check is deemed to be true.  If the code page number is not specified, Windows will offer one or more code pages, each with a confidence level.  If the highest confidence level reaches the specified threshold, the corresponding code page is deemed to be that of the text.  You may include multiple CPnnnnn keywords and use CP as the last or only one of them.
DEFnnnnnAssumes the code page specified by nnnnn if no earlier positive conclusion has been reached.  For example DEF1252 indicates Windows Latin1, the default in the USA.  If no default is provided, and no detection has been made, the associated operation will be signaled to fail, if possible.  This keyword must come last.

As each check is performed, a positive result will end the process and return a code page which will be used to load the file and convert it to CopiaFacts' internal Unicode format.

Default:

When no variable is specified, files are assumed to be in system default encoding if they have no BOM, or to be Unicode as specified by the BOM.  Individual text strings are assumed to be in the encoding specified by the command file in which they are included: command files must be in system default encoding or as specified by the supported BOM.

Examples:

"BOM UTF8 DEF1252"Used when you expect only files with a BOM, USA standard encoding files, and UTF-8 files without a BOM.
"BOM UTF8 CP932"Used when you expect UTF-8 with/without BOM, UTF-16 with BOM, SJIS, ASCII.  ANSI and DBCS non-Unicode encodings other than SJIS would be rejected.
BOM,UTF8,UTF16,CP932,CP1252As the previous example but also accepts UTF-16 without BOM and USA standard encoding.
"CP DEF1252"Relies on Windows Code Page Detection for all encodings, defaults to USA standard encoding.