Detect the encoding of a file

Well this is a doozy, and bound to require constant upkeep.

I always start off reading files like this...

using (var sr = new StreamReader(fileName))

Then, someone complains that their non-ascii files weren't read correctly, I ask for example files, perform some tests and end up with this:

using (var sr = new StreamReader(fileName, System.Text.Encoding.UTF8))

It works for a while, then I receive more complaints and more test files. I see that it doesn't work for the new test files. Hmm.

I studiously avoid thinking about "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)".

I can open the file in NotePad++ and see under the conveniently named "Encoding" menu, exactly which encoding NotePad++ decides on for a given file.

I've learned that some files have byte order marks that give pretty important clues about the encoding.

Here's my re-creation fo the table from Wikipedia: Byte order marks by encoding

Encoding Representation (hexadecimal) Representation (decimal) Might look like...
UTF-8 EF BB BF 239 187 191 
UTF-16 (BE) FE FF 254 255 þÿ
UTF-16 (LE) FF FE 255 254 ÿþ
UTF-32 (BE) 00 00 FE FF 0 0 254 255 NULNULþÿ (where NUL means the NULL character)
UTF-32 (LE) FF FE 00 00 255 254 0 0 ÿþNULNUL
UTF-7 2B 2F 76 43 47 118 +/v
UTF-1 F7 64 4C 247 100 76 ÷dL
UTF-EBCDIC DD 73 66 73 221 115 102 115 Ýsfs
SCSU 0E FE FF[c] 14 254 255 ^Nþÿ (where ^N is the shift out character)
BOCU-1 FB EE 28 251 238 40 ûî(
GB-18030 84 31 95 33 132 49 149 51 „1•3

Wikipedia points out that for UTF-8, UTF-7, UTF-1, UTF-EBCDIC, SCSU, BOCU-1, GB-18030 these starting bytes of the file are not literally a "byte order mark" as these encoding don't take multiple bytes to encode a characters, hence no byte order is needed. instead in those cases these bytes act as an indicator of the encoding.

(But this doesn't help if no byte order mark is present.)

From stack overflow I've found an answer, with a lot of upvotes, which almost works for me. It relies on the presence of a BOM as listed above, and failing that returns default.

Only change I had to make is highlighted below:

/// <summary>
/// Determines a text file's encoding by analyzing its byte order mark (BOM).
/// Defaults to ASCII when detection of the text file's endian-ness fails.
/// </summary>
/// <param name="filename">The text file to analyze.</param>
/// <returns>The detected encoding.</returns>
private static Encoding GetEncoding(string filename)
{
	// Read the BOM
	var bom = new byte[4];
	using (var file = new FileStream(filename, FileMode.Open, FileAccess.Read))
	{
		file.Read(bom, 0, 4);
	}

	// Analyze the BOM
	if (bom[0] == 0x2b && bom[1] == 0x2f && bom[2] == 0x76) return Encoding.UTF7;
	if (bom[0] == 0xef && bom[1] == 0xbb && bom[2] == 0xbf) return Encoding.UTF8;
	if (bom[0] == 0xff && bom[1] == 0xfe) return Encoding.Unicode; // UTF-16LE
	if (bom[0] == 0xfe && bom[1] == 0xff) return Encoding.BigEndianUnicode; // UTF-16BE
	if (bom[0] == 0 && bom[1] == 0 && bom[2] == 0xfe && bom[3] == 0xff) return Encoding.UTF32;
	return Encoding.Default; // **Changed this line**
}

And use it thus:

var encoding = GetEncoding(fileName);
using (var sr = new StreamReader(fileName, encoding)) // System.Text.Encoding.UTF8))

I am certain this will require further changes in future.

Bonus West Wind Version

Note that Rick Strahl has blogged a version of this here: Detecting Text Encoding for StreamReader

His work is always battle tested.

Source

See also