Hi there!
To my great surprise I've able to recover the deleted repository and I want to share my story and may be contribute to the photorec codebase. Here is what I've done:
I've made a photorec.sig file with these lines:
Code: Select all
zlib 0 0x7801
zlib 0 0x785E
zlib 0 0x789C
zlib 0 0x78DA
Then I've run photorec and selected only the first "custom Own custom signatures" file type inside the "[File Opt]" menu.
It found lots of zlib streams (with some garbage in the end) and saved them. Then I've processed them by a separate python script that walk through the files, tried to decompress them one by one, calculate a hash and place inside an empty .git repository. Not all the files recovered by photorec are indeed the git object ones and only the beginning of them are actual object, the smallest file is one cluster in size and only the beginning of it is an actual zlib stream of a git object.
Here is the script I've used:
Code: Select all
def check_file(filename):
if not os.path.isfile(filename):
return
print(filename, end='')
try:
with open(filename, "rb") as f:
compressed_contents = f.read()
dco = zlib.decompressobj()
decompressed_contents = dco.decompress(compressed_contents)
type4 = decompressed_contents[:4]
type6 = decompressed_contents[:6]
type3 = decompressed_contents[:3]
if type4 != b"tree" and type4 != b"blob" and type6 != b"commit" and type3 != b"tag":
return
hash_value = sha1(decompressed_contents).hexdigest()
print(" ", hash_value)
try:
objdir = os.path.join(dstdir, f".git/objects/{hash_value[:2]}")
objname = os.path.join(objdir, f"{hash_value[2:]}")
os.makedirs(objdir, exist_ok=True)
with open(objname, "wb") as d:
object_data = zlib.compress(decompressed_contents)
d.write(object_data)
except:
print("can't save object")
except:
print()
for filename in glob.iglob(os.path.join(srcdir, '**/**'), recursive=True):
check_file(filename)
At this point I've had a repository with lots of dangling and unreferenced trees/blobs/commits from all the deleted repositories. I've manually run "git fsck" and "git log" on every dangling commit to find the most recent one and switched the master branch reference to it. Then run "git prune" to remove the rest unreferenced objects and "git reset --hard".
Now back to the photorec code contribution. I hope the developer of this great tool is visiting the forum and can discuss it with me here. I've checked the code and it seams that photorec can automate what I've done above. Depending on how smart we want to be there are multiple options:
1. Just try to do what I've done in my script, decompress the zlib stream check the header object type and length and rename the recovered object file according to its hash value, leaving the rest to the user. In this case the user should be smart enough to make a script to postprocess the recovered files to sort them in a git repository objects folder he make separately.
2. Make a separate .git folder inside the destination recovery directory and populate its objects subdirectory with found objects. At this point user would only need to do the final step himself, manually run git commands and sort out dangling commits.
3. After the whole recovery process is done (I'm not sure if it is possible with current photorec implementation) go to the repository folder and also create branch references pointing to the top commits only. At this point with enough luck we can even get a repository we can successfully checkout. Git does not like if some objects are missing of course but well... user can copy objects from backup, guess them and recover separately etc.
4*. We can also recover git pack files which I didn't do, they have a different distinct file format. We can also go through the whole recovered text files and check which one point to the commit hashes we found and also place them inside the repository under the reflog folder etc.