I have a folder that contains multiple pdf files. I am trying to create code that will help me sanitize these filenames using a whitelist. Can anyone help? This is the code I have thus far:
import string
import unicodedata
import os
valid_filename_chars = "-_.() %s%s" % (string.ascii_letters, string.digits)
char_limit = 255
os.chdir('dir')
def clean_filename(filename, whitelist=valid_filename_chars, replace=' '):
for r in replace:
filename = filename.replace(r,'_')
cleaned_filename = unicodedata.normalize('NFKD', filename).encode('ASCII', 'ignore').decode()
cleaned_filename = ''.join(c for c in cleaned_filename if c in whitelist)
if len(cleaned_filename)>char_limit:
print("Warning, filename truncated because it was over {}. Filenames may no longer be unique".format(char_limit))
return cleaned_filename[:char_limit]
clean_filename(filename)
So you want spaces to become underscores and everything else not in your whitelist to go away?
A lot of people would do this with a regex, which is clunky in some ways but good for these kinds of things. However it’s often good to write the code for practice. In this case I guess your ascii normalization line will eliminate things like accents or cedillas or whatever, though I’m not sure you’ll always get back what you’re expecting thanks to the magic of unicode. But if you do that first, you now only have to account for a blacklist of unwanted ascii punctuation, which should be shorter than all of a-zA-Z0-9 + your literals.
To loop over the files in a folder, you would use os.listdir(), and to rename the files if necessary you’d use os.rename() . It’s a bit of extra work, but less trouble in the long run, not to use the current working directory. But to do something in any folder it’d look something likes this
def rename_folder_contents(folder):
for filename in os.listdir(folder):
new_name = clean_filename(filename)
if new_name != filename:
full_old_path = os.path.join(folder, filename)
full_new_path = os.path.join(folder, new_name)
os.rename(full_old_path, full_new_path)