Unicode encoding, conversion and validation is well understood and supported in C++ and Linux. So why providing another Unicode library for C++?
Currently (C++20 being the latest C++ standard), Unicode conversion is provided in the C++ standard library, but marked as deprecated. boost::locale provides means for Unicode conversion, but as the name suggests, it is locale dependent and using boost::locale can add dozens of megabytes to a simple executable just because of Unicode conversion which should not directly depend on locales.
Therefore, this library is provided as a C++17 and C++20 conformant way for the basic task of converting between UTF-8 (default encoding under Linux), UTF-16 (default encoding under Windows) and UTF-32 (default encoding in Qt, and generally in GUI/typesetting like FreeType).
The command line interface is just a runtime application of the provided library. There are other tools available that offer the same functionality, see below.
Usage: unicode-recode <from-format> <from-file> <to-format> <to-file> Format: UTF-8 UTF-8 UTF-16 UTF-16, native endian UTF-16LE UTF-16, little endian UTF-16BE UTF-16, big endian UTF-32 UTF-32, native endian UTF-32LE UTF-32, little endian UTF-32BE UTF-32, big endian ISO-8859-1 ISO-8859-1 (Latin-1) ISO-8859-15 ISO-8859-15 (Latin-9) Exit code: 0 if valid, 1 otherwise.
Usage: unicode-validate <format> <file> Format: UTF-8 UTF-8 UTF-16 UTF-16, big or little endian UTF-16LE UTF-16, little endian UTF-16BE UTF-16, big endian UTF-32 UTF-32, big or little endian UTF-32LE UTF-32, little endian UTF-32BE UTF-32, big endian Exit code: 0 if valid, 1 otherwise.
Example: #include <unicode.h> ... std::string utf8_value {"äöü"}; std::u16string utf16_value{unicode::convert<char, char16_t>(utf8_value)}; And for C++20: std::u8string utf8_value {u8"äöü"}; std::u16string utf16_value{unicode::convert<char8_t, char16_t>(utf8_value)}; The following encodings are implicitly deducted from types: * char resp. char8_t (C++20): UTF-8 * char16_t: UTF-16 * char32_t: UTF-32 You can specify different container types directly: std::deque<char> utf8_value {...}; std::list<wchar_t> utf16_value{unicode::convert<std::deque<char>, std::list<wchar_t>>(utf8_value)}; Explicit encoding specification is also possible: std::u8string value {"äöü"}; std::u16string utf16_value{unicode::convert<unicode::UTF_8, unicode::UTF_16>(value)}; std::string value {"äöü"}; std::u32string utf32_value{unicode::convert<unicode::ISO_8859_1, unicode::UTF_32>(value)}; Supported encodings are: * unicode::UTF_8 * unicode::UTF_16 * unicode::UTF_32 * unicode::ISO_8859_1 * unicode::ISO_8859_15 Supported basic types: * char * char8_t (C++20) * wchar_t (UTF-16 on Windows, UTF-32 on Linux) * char16_t * char32_t * uint8_t, int8_t * uint16_t, int16_t * uint32_t, int32_t * basically, all basic 8-bit, 16-bit and 32-bit that can encode UTF-8, UTF-16 and UTF-32, respectively. Supported container types: * All std container types that can be iterated (vector, list, deque) * Source and target containers can be different container types Validation can be done like this: bool valid{unicode::is_valid_utf<char16_t>(utf16_value)}; Or via explicit encoding specification: bool valid{unicode::is_valid_utf<unicode::UTF_8>(utf8_value)};
Download is available from https://www.reichwein.it/download
Installation via Debian's APT mechanism is supported for the following operating systems:/etc/apt/sources.list
:
# For Debian 11: deb http://www.reichwein.it/debian/ stable debian11 # For Ubuntu 21.04: deb http://www.reichwein.it/debian/ stable ubuntu2104 # For Ubuntu 21.10: deb http://www.reichwein.it/debian/ stable ubuntu2110
The package reichwein-keyring
helps apt to control cryptographic trust upon the packages. It can be manually installed from the above sources.
unicode-tools
(Command Line Interface, CLI) and libunicode-dev
(C++ development files) via the operating system's package mechanism:
# apt-get update # apt-get install unicode-tools libunicode-dev
Source code is available at https://www.reichwein.it/download
The git repository can be browsed at https://www.reichwein.it/cgit/unicode.git/ and cloned via:
$ git clone http://reichwein.it/git/unicode
For Debian-like systems, you can use the following APT configuration. Add the respective line from the following choices to /etc/apt/sources.list
:
# For Debian 11: deb-src http://www.reichwein.it/debian/ stable debian11 # For Ubuntu 21.04: deb-src http://www.reichwein.it/debian/ stable ubuntu2104 # For Ubuntu 21.10: deb-src http://www.reichwein.it/debian/ stable ubuntu2110