Lions, Tigers and Unicode Oh My!

Brian Pontarelli

Recently, I was working with a customer that had a URL slip through CleanSpeak’s URL filter. The URL looked something like this:

LameCasinoSite。com

The trick this user employed to get around our URL filter was using the Unicode character “ 。”(code point 0x3002 or UTF-8 0xE38082). This character looks like a period but wasn’t in the list of valid URL separators that CleanSpeak handles.

My initial thought was to simply add the character to the list. That required me to look up the Unicode code point for it first. I then realized that there were a ton of other characters that also looked like periods. In order to properly handle this, I’d need to add all of them to the list. I also noticed that there were numerous other characters someone could use to trick the URL filter like arrows, pictures and symbols.

Continue reading