cgsecurity.org

Posted: **20 Dec 2018, 20:09**

My colleague used photorec to recover several thousands of administrative files, some of them as old as 1990.
Of course, almost all had filenames like f26490920.pdf, their creation dates were lost, and it was very difficult to use them. So I wrote a small python program to try and find the files dates, and guess some intelligible name for all of them. If you want to try, I'd be happy if it could help someone else:

https://github.com/sanette/rename_by_content

It works by detecting the file type and metadata, and finally extracting text content (my colleague had many scanned documents, for them OCR is performed via tesseract). All files are actually copied in a new folder with year/month directories, so there is no danger to try: it does not modify the original files.

Supported file formats are: pdf, ai, doc, tar, zip, txt, mbox, ods, xls, xlsx, docx, docm, html, rtf, odt, png, jpg, gif, bmp, tif, ppt, pptx ,odg

Of course, feedback is welcome.

ps: date recognition is taylored for French format, but if anyone is interested in other languages, it should be straightforward to adapt. Just tell me.

Posted: **02 Jan 2019, 11:43**

Hello,

I am testing your script in Ubuntu 18.04.1 home made Bento Openbox version. I failed at the first attempt, while having the files in an external hard drive formatted to ext4. The error messages were mentioning something about permissions, so I have formatted a hard drive as Ntfs, copied all the files there and I am about to restart. The hard drive is 1.8 Tb, the data take 258 Gb, so that should do. The hard drive is plugged on a dock.

Posted: **02 Jan 2019, 12:26**

Hello,

I met with an error as a unicode decoder was not found, but I found on the internet that I could install it using pip.

https://github.com/reallistic/BitcasaFi ... /issues/29

your script does not seem to come with a recursive option? First time it was not able to go in the inside directories created by photorec

(recup_dir.1 recup_dir.20 recup_dir.31 recup_dir.42 recup_dir.53
recup_dir.10 recup_dir.21 recup_dir.32 recup_dir.43 recup_dir.54
recup_dir.11 recup_dir.22 recup_dir.33 recup_dir.44 recup_dir.55
recup_dir.12 recup_dir.23 recup_dir.34 recup_dir.45 recup_dir.56
recup_dir.13 recup_dir.24 recup_dir.35 recup_dir.46 recup_dir.57
recup_dir.14 recup_dir.25 recup_dir.36 recup_dir.47 recup_dir.58
recup_dir.15 recup_dir.26 recup_dir.37 recup_dir.48 recup_dir.6
recup_dir.16 recup_dir.27 recup_dir.38 recup_dir.49 recup_dir.7
recup_dir.17 recup_dir.28 recup_dir.39 recup_dir.5 recup_dir.8
recup_dir.18 recup_dir.29 recup_dir.4 recup_dir.50 recup_dir.9
recup_dir.19 recup_dir.3 recup_dir.40 recup_dir.51
recup_dir.2 recup_dir.30 recup_dir.41 recup_dir.52)

that makes 258 Gb.

Then I retried to test on just one directory, copying it in a new test directory.

Here is the content of that new test directory:

$ ls -l
total 85
-rwxrwxrwx 1 fluffy1 fluffy1 12066 déc. 28 16:01 exiftool.py
-rwxrwxrwx 1 fluffy1 fluffy1 12410 janv. 2 11:54 exiftool.pyc
-rwxrwxrwx 1 fluffy1 fluffy1 534 janv. 2 12:08 log-renamebycontent.txt
-rwxrwxrwx 1 fluffy1 fluffy1 12506 déc. 28 16:06 os.path
drwxrwxrwx 1 fluffy1 fluffy1 0 janv. 2 12:03 recup_dir.1
drwxrwxrwx 1 fluffy1 fluffy1 0 janv. 2 12:06 recup_dir.1-2
-rwxrwxrwx 1 fluffy1 fluffy1 38181 déc. 28 16:01 rename_by_content.py
$

I had copied your script there too, to simplify the command line, then I invoked:

Code: Select all

python ./rename_by_content.py --log log-renamebycontent.txt --output recup_dir.1-2/ recup_dir.1/*

the log file created contains now this:

-------------------------------- Summary of renamed files: --------------------------------
[recup_dir.1/f0018360_pid_0.m2ts] was copied to [recup_dir.1-2/Unknown_year/f0018360_pid_0.mts] ()
[recup_dir.1/f0507438.m2ts] was copied to [recup_dir.1-2/Unknown_year/f0507438.mts] ()
[recup_dir.1/f4521516.m2ts] was copied to [recup_dir.1-2/Unknown_year/f4521516.mts] ()
[recup_dir.1/report.xml] was copied to [recup_dir.1-2/Unknown_year/report.xml] ()
------------------- Done. Copied 4 of 4 files to recup_dir.1-2/ ---------------------

the recup_dir.1-2 directory contains this:

$ ls -lR
.:
total 0
drwxrwxrwx 1 fluffy1 fluffy1 0 janv. 2 12:08 Unknown_year

./Unknown_year:
total 2263648
-rwxrwxrwx 1 fluffy1 fluffy1 249468928 janv. 1 19:15 f0018360_pid_0.mts
-rwxrwxrwx 1 fluffy1 fluffy1 2050582528 janv. 1 19:17 f0507438.mts
-rwxrwxrwx 1 fluffy1 fluffy1 17903616 janv. 1 19:17 f4521516.mts
-rwxrwxrwx 1 fluffy1 fluffy1 14389 janv. 1 19:17 report.xml

the source directory contains the same files.

$ ls -l
total 2263648
-rwxrwxrwx 1 fluffy1 fluffy1 249468928 janv. 1 19:15 f0018360_pid_0.m2ts
-rwxrwxrwx 1 fluffy1 fluffy1 2050582528 janv. 1 19:17 f0507438.m2ts
-rwxrwxrwx 1 fluffy1 fluffy1 17903616 janv. 1 19:17 f4521516.m2ts
-rwxrwxrwx 1 fluffy1 fluffy1 14389 janv. 1 19:17 report.xml

I might want to try with another directory, containing other types of files, I guess.

Just please tell me, did I do something wrong?

Thanks.

Posted: **02 Jan 2019, 14:29**

Hello,

this time, the python script has created yearly new subdirectories. Here is the log for this first (half-)successfull iteration.
http://pastebin.fr/55468

Only a few mp3 files have got human readable names, afaik the other files/extensions have only been triaged by year.

Are any more improvements possible?

Posted: **04 Jan 2019, 14:46**

Hello,
as no answer were coming on my last 3 threads, I continued as I thought most fit. Now I have a few questions, related to this python script which brings an improvement, even if yet far from perfectly performing from gibberish named files to understandable named files.

The program can't work on all file formats : could that be improved?

The program overwrites the log, so I would like to ask if there could be some kind of append option added so that the log file would be growing, or automatically incremented to create a new one after each time the command line was called again?

If you would want to continue the discussion in French, it's also possible.

Thank you for your share!

Posted: **04 Jan 2019, 14:58**

Hello again,

just so you know, the source directory contains more MB than the destination directory once the work is finished.

[fluffy1@shebang:/media/fluffy1/0BEB84160FCC8CA3]
$ du -csm DATA-Photorec
263383 DATA-Photorec
263383 total

$ du -csm _RECUP_DIR.2/
259478 _RECUP_DIR.2/
259478 total

$ bc
bc 1.07.1
Copyright 1991-1994, 1997, 1998, 2000, 2004, 2006, 2008, 2012-2017 Free Software Foundation, Inc.
This is free software with ABSOLUTELY NO WARRANTY.
For details type `warranty'.
263383 - 259478
3905

Best regards,
Mélodie

cgsecurity.org

automatically rename files after photorec

automatically rename files after photorec

Re: automatically rename files after photorec

Re: automatically rename files after photorec

Re: automatically rename files after photorec

Re: automatically rename files after photorec

Re: automatically rename files after photorec