Python 2.5 Win32
The unquote function of Python standard module urllib performs percent decoding. According to RFC, percent encoded string is case insensitive, which means %CE, %ce, %cE and %Ce are equivalent. urllib.unquote is, however, unable to process the mixed case properly.
urllib.unquote(‘%cE’) # -> ‘%cE’
urllib.unquote(‘%Ce’) # -> ‘%Ce’
But the correct result should be ‘\xce’. Why doesn’t unquote forgive the mixed case? I had a look at the implementation.
_hextochr = dict(('%02x' % i, chr(i)) for i in range(256)) _hextochr.update(('%02X' % i, chr(i)) for i in range(256)) def unquote(s): """unquote('abc%20def') -> 'abc def'.""" res = s.split('%') for i in xrange(1, len(res)): item = res[i] try: res[i] = _hextochr[item[:2]] + item[2:] except KeyError: res[i] = '%' + item except UnicodeDecodeError: res[i] = unichr(int(item[:2], 16)) + item[2:] return "".join(res)
The _hextochr dictionary contains lower cased keys and upper cased keys but no mixed cased keys.
A work around : res[i] = _hextochr[item[:2].upper()] + item[2:]
UPDATE 2006-11-17 23:10:14 GMT+0800
always_safe = ('ABCDEFGHIJKLMNOPQRSTUVWXYZ' 'abcdefghijklmnopqrstuvwxyz' '0123456789' '_.-')
But according to RFC2396 all unreserved characters are safe, so the always_safe sequence should also include '~'.