Filtering by character set: what, why and how
Click here to read more about SpamStopsHere, the e-mail security company that brings you this blog.
What is a character set?
Have you ever gotten e-mail that has a subject line of “??? ????? ???” or is in a language you cannot understand and wondered why someone would send e-mails in gibberish? Well, the e-mail isn’t in gibberish, it’s just that it’s encoded in a format your computer doesn’t understand.
Character sets allow the e-mail client, Web browser, or really any program that displays text, to know which character the author intended amongst the thousands of characters across all the languages of the world.
Why filter based on them?
If you’ve ever used a voicemail system, you’ve probably run into the “Press one for English, press two for Klingon” etc. Filtering on character set is no different than selecting the language you speak. Why receive e-mails in Russian if you’ve never taken Russian 101? As a business, if you don’t have any customers or colleagues that speak a specific language and your employees don’t anticipate receiving e-mails in any other languages, why even deliver the messages to the inbox? They can’t be understood, and will only waste resources processing, storing and having your users read them.
How do I filter based on character sets?
This will differ based on your particular infrastructure, but all mail servers and most e-mail clients I’ve run across have ways to filter or sort e-mail. You can simply look for the character sets you don’t want and then either move them to a folder for review or reject the message outright. We recommend rejecting the e-mail so that the sender knows that the message did not arrive.
If you happen to have an e-mail gateway or a hosted anti-spam filter like SpamStopsHere, you can create the rules there saving even more load on your mail server itself.
One thing to note, this is NOT a spam filter. It is simply a policy that you are putting in place saying that you do not want to receive messages claiming to be in a character set you do not understand. This is no different than some mail server admins blocking .zip files from entering their network to prevent virus outbreaks. Not all .zips have viruses, and not all viruses are in .zips, but it’s common enough that many admins will block them regardless.
The common character sets you may want to look for and the ones we offer pre-built are
Russian/Cyrillic (Codes: windows-1251, iso-8859-5, ibm866, koi8-r)
Chinese (Codes: GB2312, GB2313, hz-gb-2312, big-5, ISO-2022-CN)
Korean (Codes: euc-kr, iso-2022-kr, ks_c_5601)
Hebrew (Codes: windows-1255, iso-8859-8)
Japanese (iso-2022-jp, x-euc, Shift_JIS)
It’s good to note if you’re creating rules for this that both the headers and the subject can include the character set, so you’ll want to watch for them in both places.
For more information on character sets you can see RFC 2045, the Unicode homepage or the Wikipedia article on character encoding.
Leave a Reply