I have a script that validates proper naming conventions in a large asset repository.
I generate a list of all file paths without problems.
but as I parse the file names, I get errors such as this:
# Error: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 81: ordinal not in range(128) #
I can skip over these by wrapping the parser in try/except and add the offending file paths to a list.
but I can’t print that list of offending file paths because the error crashes my script.
so if I have a list of file paths that I KNOW will cause this, how do I print them?
I haven’t actually solved this particular issue, but my guess is:
If you don’t know specifically how the data is encoded (likely UTF-8, but maybe not), you can’t reliably turn it into Unicode (that is, you can’t reliably print the actual value). Basically you have a byte array you want to turn into a string, but the normal translation (ascii) isn’t the correct one. You may be able to turn your values into Unicode by replacing the unknown bytes with another value, or removing them from the string entirely (using the unicode() constructor). This might print you a list of strings that you could manually use to track down the offending files:
for path in listOfOffendingFilepaths:
result = unicode(path, errors='replace') #or unicode(path, errors='ignore') to remove the bytes
print result
If for example ‘X’ is an invalid character, the string ‘c:/path/tXoX/file.obj’ would become u’c:/path/t/ufffo/ufff/file.obj’ or ‘c:/path/to/file.obj’, so the result is basically mangled and would need someone to manually interpret it.
Alternately you could attempt to decode it values by guessing the encoding, like:
for path in listOfOffendingFilepaths:
result = unicode(path, encoding='utf_8', errors='strict') #errors='strict' throws the 'UnicodeDecodeError' if it can't decode the value
print result
In which case if the encoding is indeed UTF-8, the actual value of the filepath can be printed.
Note that I haven’t actually tested any of the code, I’m just working off of the Python docs for Unicode support, but it may be a place to start.