diff options
author | Roland Reichwein <mail@reichwein.it> | 2022-01-03 15:15:09 +0100 |
---|---|---|
committer | Roland Reichwein <mail@reichwein.it> | 2022-01-03 15:15:09 +0100 |
commit | ec9c8e682d615cd2b51ea0fec05273ed4dcad50a (patch) | |
tree | b42c3bc2cfd2b79e89e846b9b4f18ea4efb23d2f /README.txt | |
parent | 92a6e4270e752acc01b0823713919b2446ec7753 (diff) |
Documentation
Diffstat (limited to 'README.txt')
-rw-r--r-- | README.txt | 163 |
1 files changed, 163 insertions, 0 deletions
diff --git a/README.txt b/README.txt new file mode 100644 index 0000000..9544b49 --- /dev/null +++ b/README.txt @@ -0,0 +1,163 @@ +Reichweit.IT unicode library +============================ + +This software package contains a C++ library for Unicode encoding conversion +and command line tools which apply those functions in example runtime programs: +recode and validate. + + +C++ interface (package libunicode-dev) +-------------------------------------- + +This library includes multiple encoding specification concepts to choose from: +While explicit specification of source and destination encodings are possible, +implicit specification of encoding of Unicode UTF encodings is also implemented +via the respective C++ types: For char8_t, char16_t and char32_t, the +respective UTF-8, UTF-16 and UTF-32 encoding is automatically used. In case of +C++17 where char8_t is not implemented, char is used instead. The same applies +for the std::basic_string<> specializations std::u8string (or std::string on +C++17), std::u16string and std::u32string. + +The main purpose of this library is conversion (and validation) between Unicode +encodings. However, Latin-1 (i.e. ISO 8859-1) and Latin-9 (i.e. ISO 8859-15) +are also implemented for practical reasons. Since the Latin character sets are +also encoded in char and std::string (at least necessarily on C++17), the Latin +encodings must be specified explicitly for disambiguation where Unicode is used +by default otherwise. I.e. UTF-8 is the default for all 8 bit character types, +UTF-16 is the default for 16 bit character types and UTF-32 is the default for +32 bit character types. + +Besides support for different character and string types from the STL, common +container types like std::vector, std::deque, std::list and std::array (the +latter only as source) are supported. + +The basic convention for the conversion interface is: + + to = unicode::convert<FromType, ToType>(from); + +where FromType and ToType can be one of: + +(1) Character type like char, char8_t, char16_t and char32_t +(2) Container type like std::string, std::list<char>, std::deque<char32_t> +(3) Explicit encoding like unicode::UTF_8, unicode::UTF_16, unicode::UTF_32, + unicode::ISO_8859_1 or unicode::ISO_8859_15 + +For the validation interface, the same principle applies: + + bool flag = unicode::is_valid_utf<FromType>(from); + +There is also a Unicode character validation function which operates on Unicode +character values directly, i.e. no specific encoding is used but 32 bit (or +less) values are evaluated for a valid Unicode character: + + bool flag = unicode::is_valid_unicode(character_value); + +While this validates a Unicode value in general, it doesn't tell if the +specified value is actually designated in an actual Unicode version. E.g. as of +2022, in current Unicode version 14.0, the character 0x1FABA "NEST WITH EGGS" +is designated, but not 0x1FABB. Both of them would be detected as "valid" by +unicode::is_valid_unicode(). See also: + +https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt + + +Examples: + +#include <unicode.h> +... + +C++17 conversion of a UTF-8 string to UTF-16: + + std::string utf8_value {u8"äöü"}; + std::u16string utf16_value{unicode::convert<char, char16_t>(utf8_value)}; + +C++20 conversion of a UTF-8 string to UTF-16: + + std::u8string utf8_value {u8"äöü"}; + std::u16string utf16_value{unicode::convert<char8_t, char16_t>(utf8_value)}; + +The following encodings are implicitly deducted from types: + * char resp. char8_t (C++20): UTF-8 + * char16_t: UTF-16 + * char32_t: UTF-32 + +Specification via container types: + + std::deque<char> utf8_value {...}; + std::list<wchar_t> utf16_value{unicode::convert<std::deque<char>, std::list<wchar_t>>(utf8_value)}; + +Explicit encoding specification: + + std::string value {"äöü"}; + std::u32string utf32_value{unicode::convert<unicode::ISO_8859_1, unicode::UTF_32>(value)}; + +Supported encodings are: + + * unicode::UTF_8 + * unicode::UTF_16 + * unicode::UTF_32 + * unicode::ISO_8859_1 + * unicode::ISO_8859_15 + +Supported basic types for source and target characters: + * char + * char8_t (C++20) + * wchar_t (UTF-16 on Windows, UTF-32 on Linux) + * char16_t + * char32_t + * uint8_t, int8_t + * uint16_t, int16_t + * uint32_t, int32_t + * basically, all basic 8-bit, 16-bit and 32-bit that can encode + UTF-8, UTF-16 and UTF-32, respectively. + +Supported container types: + * All std container types that can be iterated (vector, list, deque, array) + * Source and target containers can be different container types + +Validation can be done like this: + + bool valid{unicode::is_valid_utf<char16_t>(utf16_value)}; + +Or via explicit encoding specification: + + bool valid{unicode::is_valid_utf<unicode::UTF_8>(utf8_value)}; + + +CLI interface (package unicode-tools) +------------------------------------- + +* unicode-recode + + Usage: recode <from-format> <from-file> <to-format> <to-file> + Format: + UTF-8 UTF-8 + UTF-16 UTF-16, native endian + UTF-16LE UTF-16, little endian + UTF-16BE UTF-16, big endian + UTF-32 UTF-32, native endian + UTF-32LE UTF-32, little endian + UTF-32BE UTF-32, big endian + ISO-8859-1 ISO-8859-1 (Latin-1) + ISO-8859-15 ISO-8859-15 (Latin-9) + Exit code: 0 if valid, 1 otherwise. + +* unicode-validate + + Usage: validate <format> <file> + Format: + UTF-8 UTF-8 + UTF-16 UTF-16, big or little endian + UTF-16LE UTF-16, little endian + UTF-16BE UTF-16, big endian + UTF-32 UTF-32, big or little endian + UTF-32LE UTF-32, little endian + UTF-32BE UTF-32, big endian + Exit code: 0 if valid, 1 otherwise. + + +Contact +------- + +Reichwein IT <mail@reichwein.it> + |