A text file can be encoded in many different character encodings. There are many encoding variations even just for Windows system. Special attention has to be given when handling text files with different character encodings, e.g. if we use fstream‘s getline() to read from a text file (contains Chinese) in UTF-8, we will get gibberish characters, while will be correct if the text file is in ANSI.
In this post, given a text file, I will show how to get its character encoding and how to convert it from one character encoding to another.
Get the character encoding of a text file
Actually, in many cases, we cannot be sure about which character encoding a file is encoded. In the following, I will only give the method to get a text file’s character encoding if its character encoding can only be 4 basic ones, namely ANSI, Unicode, Unicode big endian and UTF-8 (with BOM). These 4 encodings are all Notepad supports. It cannot guarantee to give the correct answer if not satisfy this, e.g. it will be considered as a ANSI file if it’s UTF-8 (without BOM). See the following code (based on [1][2]).
// 0 - ANSI // 1 - Unicode // 2 - Unicode big endian // 3 - UTF-8 // NOTE: only correct for handling normal encodings (i.e. see NOTEPAD's 4 types) // , eg. UTF-8 with BOM, if UTF-8 without BOM, will considered as ANSI intget_text_file_encoding(constchar *filename) { int nReturn = -1;
As the example goes in the beginning that if we use fstream‘s getline() to read from a text file (contains Chinese) in UTF8, we will get gibberish characters, while will be correct if the text file is in ANSI, in the following, only 2 converting methods will be given, namely UTF-8 (with BOM) to ANSI and UTF-8 (without BOM) to ANSI. See the following code (based on [3][4]). Conversions between UTF-8, UTF-16 and UTF-32 can be seen from [5].
// read UTF-8 (with BOM) file and convert it to be in ANSI voidchange_text_file_encoding_from_UTF8_with_BOM_to_ANSI(constchar* filename) { ifstream infile; string strLine=""; string strResult=""; infile.open(filename); if (infile) { // the first 3 bytes (ef bb bf) is UTF-8 header flags // all the others are single byte ASCII code. // should delete these 3 when output getline(infile, strLine); strResult += strLine.substr(3)+"\n";