summaryrefslogtreecommitdiffhomepage
path: root/README.txt
blob: c4a98f504803f407fcdb35fe889fd74fa800de6c (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
Reichweit.IT Unicode Library
============================

This software package contains a C++ library for Unicode encoding conversion
and command line tools which apply those functions in example runtime programs:
recode and validate.

Properties
----------

* Supports C++17 and C++20
* Locale independent validation and conversion
* Supports UTF-8, UTF-16, UTF-32, ISO-8859-1 and ISO-8859-15
* Supports Linux and Windows
* Supports current compilers (clang++-11, clang++-13, g++-11, msvc-19.28.29337)


C++ interface (package libunicode-dev)
--------------------------------------

This library includes multiple encoding specification concepts to choose from:
While explicit specification of source and destination encodings are possible,
implicit specification of encoding of Unicode UTF encodings is also implemented
via the respective C++ types: For char8_t, char16_t and char32_t, the
respective UTF-8, UTF-16 and UTF-32 encoding is automatically used. In case of
C++17 where char8_t is not implemented, char is used instead. The same applies
for the std::basic_string<> specializations std::u8string (or std::string on
C++17), std::u16string and std::u32string.

The main purpose of this library is conversion (and validation) between Unicode
encodings. However, Latin-1 (i.e. ISO 8859-1) and Latin-9 (i.e. ISO 8859-15)
are also implemented for practical reasons. Since the Latin character sets are
also encoded in char and std::string (at least necessarily on C++17), the Latin
encodings must be specified explicitly for disambiguation where Unicode is used
by default otherwise. I.e. UTF-8 is the default for all 8 bit character types,
UTF-16 is the default for 16 bit character types and UTF-32 is the default for
32 bit character types.

Besides support for different character and string types from the STL, common
container types like std::vector, std::deque, std::list and std::array (the
latter only as source) are supported.

The basic convention for the conversion interface is:

    to = unicode::convert<FromType, ToType>(from);

where FromType and ToType can be one of:

(1) Character type like char, char8_t, char16_t and char32_t
(2) Container type like std::string, std::list<char>, std::deque<char32_t>
(3) Explicit encoding like unicode::UTF_8, unicode::UTF_16, unicode::UTF_32,
    unicode::ISO_8859_1 or unicode::ISO_8859_15

For the validation interface, the same principle applies:

    bool flag = unicode::is_valid_utf<FromType>(from);

There is also a Unicode character validation function which operates on Unicode
character values directly, i.e. no specific encoding is used but 32 bit (or
less) values are evaluated for a valid Unicode character:

    bool flag = unicode::is_valid_unicode(character_value);

While this validates a Unicode value in general, it doesn't tell if the
specified value is actually designated in an actual Unicode version. E.g. as of
2022, in current Unicode version 14.0, the character 0x1FABA "NEST WITH EGGS"
is designated, but not 0x1FABB. Both of them would be detected as "valid" by
unicode::is_valid_unicode(). See also:

https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt


Examples:

#include <unicode.h>
...

C++17 conversion of a UTF-8 string to UTF-16:

  std::string utf8_value {u8"äöü"};
  std::u16string utf16_value{unicode::convert<char, char16_t>(utf8_value)};

C++20 conversion of a UTF-8 string to UTF-16:

  std::u8string utf8_value {u8"äöü"};
  std::u16string utf16_value{unicode::convert<char8_t, char16_t>(utf8_value)};

The following encodings are implicitly deducted from types:
  * char resp. char8_t (C++20): UTF-8
  * char16_t: UTF-16
  * char32_t: UTF-32

Specification via container types:
  
  std::deque<char> utf8_value {...};
  std::list<wchar_t> utf16_value{unicode::convert<std::deque<char>, std::list<wchar_t>>(utf8_value)};

Explicit encoding specification:

  std::string value {"äöü"};
  std::u32string utf32_value{unicode::convert<unicode::ISO_8859_1, unicode::UTF_32>(value)};

Supported encodings are:

  * unicode::UTF_8
  * unicode::UTF_16
  * unicode::UTF_32
  * unicode::ISO_8859_1
  * unicode::ISO_8859_15

Supported basic types for source and target characters:
  * char
  * char8_t (C++20)
  * wchar_t (UTF-16 on Windows, UTF-32 on Linux)
  * char16_t
  * char32_t
  * uint8_t, int8_t
  * uint16_t, int16_t
  * uint32_t, int32_t
  * basically, all basic 8-bit, 16-bit and 32-bit that can encode
    UTF-8, UTF-16 and UTF-32, respectively.

Supported container types:
  * All std container types that can be iterated (vector, list, deque, array)
  * Source and target containers can be different container types

Validation can be done like this:

  bool valid{unicode::is_valid_utf<char16_t>(utf16_value)};

Or via explicit encoding specification:

  bool valid{unicode::is_valid_utf<unicode::UTF_8>(utf8_value)};


CLI interface (package unicode-tools)
-------------------------------------

* unicode-recode

  Usage: recode <from-format> <from-file> <to-format> <to-file>
  Format:
      UTF-8       UTF-8
      UTF-16      UTF-16, native endian
      UTF-16LE    UTF-16, little endian
      UTF-16BE    UTF-16, big endian
      UTF-32      UTF-32, native endian
      UTF-32LE    UTF-32, little endian
      UTF-32BE    UTF-32, big endian
      ISO-8859-1  ISO-8859-1 (Latin-1)
      ISO-8859-15 ISO-8859-15 (Latin-9)
  Exit code: 0 if valid, 1 otherwise.

* unicode-validate

  Usage: validate <format> <file>
  Format:
      UTF-8     UTF-8
      UTF-16    UTF-16, big or little endian
      UTF-16LE  UTF-16, little endian
      UTF-16BE  UTF-16, big endian
      UTF-32    UTF-32, big or little endian
      UTF-32LE  UTF-32, little endian
      UTF-32BE  UTF-32, big endian
  Exit code: 0 if valid, 1 otherwise.


Contact
-------

Reichwein IT <mail@reichwein.it>