automatically rename files after photorec
Posted: 20 Dec 2018, 20:09
My colleague used photorec to recover several thousands of administrative files, some of them as old as 1990.
Of course, almost all had filenames like f26490920.pdf, their creation dates were lost, and it was very difficult to use them. So I wrote a small python program to try and find the files dates, and guess some intelligible name for all of them. If you want to try, I'd be happy if it could help someone else:
https://github.com/sanette/rename_by_content
It works by detecting the file type and metadata, and finally extracting text content (my colleague had many scanned documents, for them OCR is performed via tesseract). All files are actually copied in a new folder with year/month directories, so there is no danger to try: it does not modify the original files.
Supported file formats are: pdf, ai, doc, tar, zip, txt, mbox, ods, xls, xlsx, docx, docm, html, rtf, odt, png, jpg, gif, bmp, tif, ppt, pptx ,odg
Of course, feedback is welcome.
ps: date recognition is taylored for French format, but if anyone is interested in other languages, it should be straightforward to adapt. Just tell me.
Of course, almost all had filenames like f26490920.pdf, their creation dates were lost, and it was very difficult to use them. So I wrote a small python program to try and find the files dates, and guess some intelligible name for all of them. If you want to try, I'd be happy if it could help someone else:
https://github.com/sanette/rename_by_content
It works by detecting the file type and metadata, and finally extracting text content (my colleague had many scanned documents, for them OCR is performed via tesseract). All files are actually copied in a new folder with year/month directories, so there is no danger to try: it does not modify the original files.
Supported file formats are: pdf, ai, doc, tar, zip, txt, mbox, ods, xls, xlsx, docx, docm, html, rtf, odt, png, jpg, gif, bmp, tif, ppt, pptx ,odg
Of course, feedback is welcome.
ps: date recognition is taylored for French format, but if anyone is interested in other languages, it should be straightforward to adapt. Just tell me.