Recovering deleted git repository

Using PhotoRec to recover lost data
Forum rules
When asking for technical support:
- Search for posts on the same topic before posting a new question.
- Give clear, specific information in the title of your post.
- Include as many details as you can, MOST POSTS WILL GET ONLY ONE OR TWO ANSWERS.
- Post a follow up with a "Thank you" or "This worked!"
- When you learn something, use that knowledge to HELP ANOTHER USER LATER.
Before posting, please read https://www.cgsecurity.org/testdisk.pdf
Message
Author
gituser
Posts: 8
Joined: 14 Oct 2021, 11:53

Recovering deleted git repository

#1 Post by gituser »

It is rather easy to restore objects hashes and fsck recovered repository to find latest commit etc.

Git stores objects compressed with zlib, the file starts with magic 0x78 0x01. I don't see these files among those recovered by photorec. What would you recommend?

recuperation
Posts: 2720
Joined: 04 Jan 2019, 09:48
Location: Hannover, Deutschland (Germany, Allemagne)

Re: Recovering deleted git repository

#2 Post by recuperation »

gituser wrote: 14 Oct 2021, 12:00 It is rather easy to restore objects hashes and fsck recovered repository to find latest commit etc.

Git stores objects compressed with zlib, the file starts with magic 0x78 0x01. I don't see these files among those recovered by photorec.
Works as designed: zlip is not a supported file format.
See here:
https://www.cgsecurity.org/wiki/File_Fo ... y_PhotoRec

Build your individual fingerprint as described in the Testdisk manual.

gituser
Posts: 8
Joined: 14 Oct 2021, 11:53

Re: Recovering deleted git repository

#3 Post by gituser »

Thanks for the quick reply!

gituser
Posts: 8
Joined: 14 Oct 2021, 11:53

Re: Recovering deleted git repository

#4 Post by gituser »

Hi there!
To my great surprise I've able to recover the deleted repository and I want to share my story and may be contribute to the photorec codebase. Here is what I've done:

I've made a photorec.sig file with these lines:

Code: Select all

zlib 0 0x7801
zlib 0 0x785E
zlib 0 0x789C
zlib 0 0x78DA
Then I've run photorec and selected only the first "custom Own custom signatures" file type inside the "[File Opt]" menu.
It found lots of zlib streams (with some garbage in the end) and saved them. Then I've processed them by a separate python script that walk through the files, tried to decompress them one by one, calculate a hash and place inside an empty .git repository. Not all the files recovered by photorec are indeed the git object ones and only the beginning of them are actual object, the smallest file is one cluster in size and only the beginning of it is an actual zlib stream of a git object.

Here is the script I've used:

Code: Select all

def check_file(filename):
    if not os.path.isfile(filename): 
        return

    print(filename, end='')
    try:
        with open(filename, "rb") as f:
            compressed_contents = f.read()
            dco = zlib.decompressobj()
            decompressed_contents = dco.decompress(compressed_contents)
            type4 = decompressed_contents[:4]
            type6 = decompressed_contents[:6]
            type3 = decompressed_contents[:3]
            if type4 != b"tree" and type4 != b"blob" and type6 != b"commit" and type3 != b"tag":
                return
            hash_value = sha1(decompressed_contents).hexdigest()
            print(" ", hash_value)
            try:
                objdir = os.path.join(dstdir, f".git/objects/{hash_value[:2]}")
                objname = os.path.join(objdir, f"{hash_value[2:]}")
                os.makedirs(objdir, exist_ok=True)
                with open(objname, "wb") as d:
                    object_data = zlib.compress(decompressed_contents)
                    d.write(object_data)
            except:
                print("can't save object")
    except:
        print()
        
for filename in glob.iglob(os.path.join(srcdir, '**/**'), recursive=True):
    check_file(filename)
At this point I've had a repository with lots of dangling and unreferenced trees/blobs/commits from all the deleted repositories. I've manually run "git fsck" and "git log" on every dangling commit to find the most recent one and switched the master branch reference to it. Then run "git prune" to remove the rest unreferenced objects and "git reset --hard".


Now back to the photorec code contribution. I hope the developer of this great tool is visiting the forum and can discuss it with me here. I've checked the code and it seams that photorec can automate what I've done above. Depending on how smart we want to be there are multiple options:

1. Just try to do what I've done in my script, decompress the zlib stream check the header object type and length and rename the recovered object file according to its hash value, leaving the rest to the user. In this case the user should be smart enough to make a script to postprocess the recovered files to sort them in a git repository objects folder he make separately.

2. Make a separate .git folder inside the destination recovery directory and populate its objects subdirectory with found objects. At this point user would only need to do the final step himself, manually run git commands and sort out dangling commits.

3. After the whole recovery process is done (I'm not sure if it is possible with current photorec implementation) go to the repository folder and also create branch references pointing to the top commits only. At this point with enough luck we can even get a repository we can successfully checkout. Git does not like if some objects are missing of course but well... user can copy objects from backup, guess them and recover separately etc.

4*. We can also recover git pack files which I didn't do, they have a different distinct file format. We can also go through the whole recovered text files and check which one point to the commit hashes we found and also place them inside the repository under the reflog folder etc.

recuperation
Posts: 2720
Joined: 04 Jan 2019, 09:48
Location: Hannover, Deutschland (Germany, Allemagne)

Re: Recovering deleted git repository

#5 Post by recuperation »

Thank you for making your solution available for everybody!

The garbage at some file ends results from a fragmented disk. Not being aware of any metadata does not know where a zlib file ends Photorec will assign the following data to the zlib file in question until it finds another valid zlib fingerprint because you exclude all other file types:
Then I've run photorec and selected only the first "custom Own custom signatures" file type inside the "[File Opt]" menu.
You could allow all other file types as well and the zlib files may shrink because Photorec may find other signatures, finalizing your current zlib file and starting to recover a different file type.

gituser
Posts: 8
Joined: 14 Oct 2021, 11:53

Re: Recovering deleted git repository

#6 Post by gituser »

Small files are also bigger than the stream and also have garbage in the end of a cluster. Zlib has the EOF mark in a stream and also a git object has its length written in the beginning of the decompressed stream so it is not a problem. Also a git object filename (or better say the place inside repository) can be recovered by hashing its decompressed content. It is easy to add a new file format to photorec and extract the file length and filename the same way photorec process other files. But, is it possible to change photorec behavior in such a way so it will build the whole repository folder structure not just recover a bunch of separate files? Are there any other complex folder structure recovery implemented in photorec already?

recuperation
Posts: 2720
Joined: 04 Jan 2019, 09:48
Location: Hannover, Deutschland (Germany, Allemagne)

Re: Recovering deleted git repository

#7 Post by recuperation »

gituser wrote: 17 Oct 2021, 14:30 But, is it possible to change photorec behavior in such a way so it will build the whole repository folder structure not just recover a bunch of separate files?
Photorec is a armageddon tool that does not use metadata such as folder information. To restore folder information try out any other commercial software.
Are there any other complex folder structure recovery implemented in photorec already?
There are neither complex oder simple folder structures implemented. Photorec works as designed. That's why you need other software if you want to do that.

gituser
Posts: 8
Joined: 14 Oct 2021, 11:53

Re: Recovering deleted git repository

#8 Post by gituser »

recuperation wrote: 18 Oct 2021, 07:39 To restore folder information try out any other commercial software.
I am not asking for support. I have already solved the issue for myself. Now I am trying to figure out the best way how to integrate this automation to the photorec and I will make and submit a patch. I can write a code that extracts the metadata and position of the file inside a repository folder structure, it is straightforward from the object hash.

Photorec makes lots of "recup_dir.XXX" directories to place recovered files. I want to make a separate ".git" folder somewhere among those "recup_dir.XXX" folders and automatically populate it with recovered git object files recreating the desired repository folder structure. Can a format extractor control the destination folder for a recovered file? From a quick look on the code it seams that this level of recovered files processing is not quit possible or I am missing something. I see that a file format processor can change a filename but I am not sure if it can define a directory name. Writing a new format extractor that only produce separate git object files is better than nothing but not very convenient, user would need to sort them afterwards and we will need to invent a file name extension for those recovered object files, originally git object doesn't have an extension in the name.

Are you familiar with photorec source code? At what point the destination directory for a recovered file is determined? If you can't answer and the author of the tool doesn't visit the forum then I will read the code more closely later.

recuperation
Posts: 2720
Joined: 04 Jan 2019, 09:48
Location: Hannover, Deutschland (Germany, Allemagne)

Re: Recovering deleted git repository

#9 Post by recuperation »

gituser wrote: 18 Oct 2021, 13:44
recuperation wrote: 18 Oct 2021, 07:39 To restore folder information try out any other commercial software.
I am not asking for support. I have already solved the issue for myself. Now I am trying to figure out the best way how to integrate this automation to the photorec and I will make and submit a patch.
And how should I know if you don't tell that right from the start? When reading your posting it sounds like the other dozens of users who want Photorec to be modified for their personal needs.

I can write a code that extracts the metadata and position of the file inside a repository folder structure, it is straightforward from the object hash.

Photorec makes lots of "recup_dir.XXX" directories to place recovered files. I want to make a separate ".git" folder somewhere among those "recup_dir.XXX" folders and automatically populate it with recovered git object files recreating the desired repository folder structure. Can a format extractor control the destination folder for a recovered file? From a quick look on the code it seams that this level of recovered files processing is not quit possible or I am missing something. I see that a file format processor can change a filename but I am not sure if it can define a directory name. Writing a new format extractor that only produce separate git object files is better than nothing but not very convenient, user would need to sort them afterwards and we will need to invent a file name extension for those recovered object files, originally git object doesn't have an extension in the name.

Are you familiar with photorec source code?
No.
At what point the destination directory for a recovered file is determined? If you can't answer and the author of the tool doesn't visit the forum then I will read the code more closely later.
The author of the tool is Christophe Grenier. To submit a patch I would rather contact him before writing code. If you check the website https://www.cgsecurity.org you certainly find ways of contacting him. You certainly understand that I refrain from posting his email address. "Visiting the forum" is not required for him and would not assure anyway that he reads your posting.

gituser
Posts: 8
Joined: 14 Oct 2021, 11:53

Re: Recovering deleted git repository

#10 Post by gituser »

recuperation wrote: 18 Oct 2021, 15:31 And how should I know if you don't tell that right from the start? When reading your posting it sounds like the other dozens of users who want Photorec to be modified for their personal needs.
I wrote this in the first line of my comment
gituser wrote: 15 Oct 2021, 05:19 I've able to recover the deleted repository and I want to share my story and may be contribute to the photorec codebase
Ok. I will check the code more closely later and mail him.

Locked