Character encodings of text files

A text file can be encoded in many different character encodings. There are many encoding variations even just for Windows system. Special attention has to be given when handling text files with different character encodings, e.g. if we use fstream‘s getline() to read from a text file (contains Chinese) in UTF-8, we will get gibberish characters, while will be correct if the text file is in ANSI.

In this post, given a text file, I will show how to get its character encoding and how to convert it from one character encoding to another.

Get the character encoding of a text file

Actually, in many cases, we cannot be sure about which character encoding a file is encoded. In the following, I will only give the method to get a text file’s character encoding if its character encoding can only be 4 basic ones, namely ANSI, Unicode, Unicode big endian and UTF-8 (with BOM). These 4 encodings are all Notepad supports. It cannot guarantee to give the correct answer if not satisfy this, e.g. it will be considered as a ANSI file if it’s UTF-8 (without BOM). See the following code (based on [1] [2]).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
// 0 - ANSI
// 1 - Unicode
// 2 - Unicode big endian
// 3 - UTF-8
// NOTE: only correct for handling normal encodings (i.e. see NOTEPAD's 4 types)
// , eg. UTF-8 with BOM, if UTF-8 without BOM, will considered as ANSI
int get_text_file_encoding(const char *filename)
{
int nReturn = -1;

unsigned char uniTxt[] = {0xFF, 0xFE};// Unicode file header
unsigned char endianTxt[] = {0xFE, 0xFF};// Unicode big endian file header
unsigned char utf8Txt[] = {0xEF, 0xBB};// UTF_8 file header

DWORD dwBytesRead = 0;
HANDLE hFile = CreateFile(filename, GENERIC_READ, FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL);
if (hFile == INVALID_HANDLE_VALUE)
{
hFile = NULL;
CloseHandle(hFile);
return -1;
}
BYTE *lpHeader = new BYTE[2];
ReadFile(hFile, lpHeader, 2, &dwBytesRead, NULL);
CloseHandle(hFile);

if (lpHeader[0] == uniTxt[0] && lpHeader[1] == uniTxt[1])// Unicode file
nReturn = 1;
else if (lpHeader[0] == endianTxt[0] && lpHeader[1] == endianTxt[1])// Unicode big endian file
nReturn = 2;
else if (lpHeader[0] == utf8Txt[0] && lpHeader[1] == utf8Txt[1])// UTF-8 file
nReturn = 3;
else
nReturn = 0; //Ascii

delete []lpHeader;
return nReturn;
}

Convert from one character encoding to another

As the example goes in the beginning that if we use fstream‘s getline() to read from a text file (contains Chinese) in UTF8, we will get gibberish characters, while will be correct if the text file is in ANSI, in the following, only 2 converting methods will be given, namely UTF-8 (with BOM) to ANSI and UTF-8 (without BOM) to ANSI. See the following code (based on [3] [4]). Conversions between UTF-8, UTF-16 and UTF-32 can be seen from [5].

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
// change a char's encoding from UTF8 to ANSI
char* change_encoding_from_UTF8_to_ANSI(char* szU8)
{
int wcsLen = ::MultiByteToWideChar(CP_UTF8, NULL, szU8, strlen(szU8), NULL, 0);
wchar_t* wszString = new wchar_t[wcsLen + 1];
::MultiByteToWideChar(CP_UTF8, NULL, szU8, strlen(szU8), wszString, wcsLen);
wszString[wcsLen] = '\0';

int ansiLen = ::WideCharToMultiByte(CP_ACP, NULL, wszString, wcslen(wszString), NULL, 0, NULL, NULL);
char* szAnsi = new char[ansiLen + 1];
::WideCharToMultiByte(CP_ACP, NULL, wszString, wcslen(wszString), szAnsi, ansiLen, NULL, NULL);
szAnsi[ansiLen] = '\0';

return szAnsi;
}

// read UTF-8 (with BOM) file and convert it to be in ANSI
void change_text_file_encoding_from_UTF8_with_BOM_to_ANSI(const char* filename)
{
ifstream infile;
string strLine="";
string strResult="";
infile.open(filename);
if (infile)
{
// the first 3 bytes (ef bb bf) is UTF-8 header flags
// all the others are single byte ASCII code.
// should delete these 3 when output
getline(infile, strLine);
strResult += strLine.substr(3)+"\n";

while(!infile.eof())
{
getline(infile, strLine);
strResult += strLine+"\n";
}
}
infile.close();

char* changeTemp=new char[strResult.length()];
strcpy(changeTemp, strResult.c_str());
char* changeResult = change_encoding_from_UTF8_to_ANSI(changeTemp);
strResult=changeResult;

ofstream outfile;
outfile.open(filename);
outfile.write(strResult.c_str(),strResult.length());
outfile.flush();
outfile.close();
}

// read UTF-8 (without BOM) file and convert it to be in ANSI
void change_text_file_encoding_from_UTF8_without_BOM_to_ANSI(const char* filename)
{
ifstream infile;
string strLine="";
string strResult="";
infile.open(filename);
if (infile)
{
while(!infile.eof())
{
getline(infile, strLine);
strResult += strLine+"\n";
}
}
infile.close();

char* changeTemp=new char[strResult.length()];
strcpy(changeTemp, strResult.c_str());
char* changeResult = change_encoding_from_UTF8_to_ANSI(changeTemp);
strResult=changeResult;

ofstream outfile;
outfile.open(filename);
outfile.write(strResult.c_str(),strResult.length());
outfile.flush();
outfile.close();
}

References

  1. What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text: http://kunststube.net/encoding/