Various issues with Photorec / suggestions of improvements

Using PhotoRec to recover lost data
Forum rules
When asking for technical support:
- Search for posts on the same topic before posting a new question.
- Give clear, specific information in the title of your post.
- Include as many details as you can, MOST POSTS WILL GET ONLY ONE OR TWO ANSWERS.
- Post a follow up with a "Thank you" or "This worked!"
- When you learn something, use that knowledge to HELP ANOTHER USER LATER.
Before posting, please read https://www.cgsecurity.org/testdisk.pdf
Locked
Message
Author
BitterColdSoul
Posts: 50
Joined: 07 Jun 2020, 20:38
Location: France

Various issues with Photorec / suggestions of improvements

#1 Post by BitterColdSoul »

[ENGLISH]

Running Photorec on a formatted 3TB HDD, previously scanned with R-Studio (which retrieved a good portion of the original filesystem, so the files it recovered were complete, except for those which had been partially or totally overwritten), I discovered several issues (I was then using Photorec 7.1 WIP, I have yet to test version 7.2).

– Sometimes the scan can become extremely slow, and the count of the sector currently read keeps oscillating between two ranges of values (in this case, between 3200000000 and 3800000000 approximately). From what I can understand, if a file is encountered which looks fragmented, Photorec tries to find the complimentary fragment, sometimes parsing dozens or even hundreds of gigabytes, which seems absurd from an efficiency standpoint. It would be much quicker to first extract the identified files for which all clusters are located sequentially, then mark those areas as already processed in the “photorec.ses” session backup file, and then only once all those files have been extracted, attempt to reconstruct fragmented files. (Actually, there should be an option to disable attempts to reconstruct fragmented files altogether, as those are unlikely to be successful anyway.) Or if that is not the cause of those oscillations (since I noticed that some files identified as “broken” were indeed recovered at the very end of the process), then it's even more of a mystery.

– In this recovery I got hundreds of “MP3” files with a size ranging from 2 to 12KB, obviously invalid. Yet when comparing with WinHex some files (video files in particular) identified as “broken” with their counterparts recovererd completely by R-Studio (detected with DoubleKiller, a duplicate file finder, set to check the first few KB only), I found out that the file recovered by Photorec – for instance a 1.1GB WMV file – was interrupted by what (in the valid file) looked like a MP3 header signature (“ÿû” in ASCII, or “FF BB” in hexadecimal), and then a chunk was missing which happened to match one of these fake 2-12KB MP3 files, then the rest of the file was arbitrarily merged at the position where the missing chunk should have been ; and at the end the file continued beyond its original limit, so that the total size was approximately identical despite the missing chunk (in this case it's probably because the WMV header contains information about the total size of the file, which Photorec uses to determine the expected size of the extracted file). Of course the file extracted by Photorec was unreadable beyond this cut, and yet that file could have been easily recovered in its entirety since all its sectors were located sequentially (which I verified with R-Studio's hexadecimal viewer, in the “sectors” tab). If another fake MP3 signature is found, the same process happens again – and so I've had some video files riddled with small holes, with a simultaneous extraction of dozens of fake MP3 files...
My current knowledge stops here, I don't know what should be changed so that in such a case the file gets correctly recovered. Clearly, identifying MP3 files with only two characters at the beginning of a cluster is insufficient and bound to generate many false positives (one of these fake MP3 was beginning with “ÿóÔ” in ASCII, or “FF F3 D4” in hexadecimal, which doesn't even correspond to any actual MP3 header I've ever seen). In such a case, the algorithm should be modified in such a way that, in case of an interruption of one file by another which has an unlikely size (2-12KB would be an unlikely size for a MP3 file : that's less than 1s at a 128kb/s bitrate), the extraction of the former file should continue sequentially, discarding the obviously erroneous file as an artifact (or optionally extracting it as well for verification purposes, with a name indicating that it's most likely an invalid file, actually belonging to the larger file extracted right before). Or, for files which store information regarding their size in their header, Photorec should, right after parsing the header, verify if the last sector (based on that information) corresponds to what would be expected (I'm not sure how it should be analyzed, since very few file types have an actual “end of file” marker, JPG files being among the rare ones that do... but I noticed empirically that most often the end of a video file has a lower “density” of data compared with a random segment in the middle, and there's generaly an empty space right after the end of the file, but that's not always the case, for instance if the file has overwritten another there can be random data in “slack space”), in which case the file should be treated as entirely sequential, even if a hypothetical header of another type is identified during the extraction. Or, for MP3 files, it must be possible to check the presence of specific fields corresponding to the bitrate or the sampling rate.
The same also happens frequently with fake JPG signatures (“ÿØÿ” in ASCII, or “FF D8 FF” in hexadecimal). In this case a valid file seems to be cut short, instead of cut and merged with what comes after as in the case of fake MP3 signatures (which is actually better, because those arbitrary micro-cuts are otherwise very difficult to detect, if the user doesn't have the corresponding valid files to compare with, which is generally the case when using Photorec which, like the A-Team, works wonders when all else fails). It seems to me that some simple criterions would allow to exclude most of those false positives (again, the density or entropy of data in a valid header is significantly inferior to that of a chunk at a random position inside a video file, and it should contain specific fields indicating the resolutions and other values, which would be absent in an artifact). This is especially surprising since, based on its name, Photorec was primarily conceived to recover picture files, and should therefore be particularly optimized to distinguish valid picture files from artifacts. Those results also seem to contradict informations provided in the official description of the software, copied in the Wikipedia article ; it is said that a JPEG file is identified by 3 possible signatures : “FF D8 FF E0”, “FF D8 FF E1”, “FF D8 FF FE”, yet I've had erroneous .jpg files recovered by Photorec beginning with “FF D8 FF” followed by seemingly any character, which significantly reduces the statistical specificity of identification.
I have also seen valid files interrupted by fake ZIP files (signature “50 4B 03 04”) or fake MPG files (signature “00 00 01 BA”).
It can also happen that a valid file is interrupted by the presence of an actual file of another type embedded in itself in uncompressed form ; for instance a PDF file containing JPG pictures is extracted with “holes” (compared with the original file) corresponding to those JPG pictures, extracted separately by Photorec (it's interesting because in this case the resulting PDF file is still readable, but all picures that used to be in the original file are missing).
In the case of ISO images, the problem is compounded as these can contain all types of files in uncompressed form and therefore identifiable by their signature, so even if a valid file is detected (for instance a MPG video in a game ISO), the extraction of the ISO image stops and the resulting file is invalid, even though it was stored sequentially and could have been recovered entirely.
Once again, the default behaviour, for all files which store information regarding their size in their header, should be to extract sectors sequentially until the expected size is reached. As it is now, the only way to circumvent this kind of issues would be to run a complete analysis for each file type of interest, by unchecking all the other file types in the options, which of course is not a practical solution, especially for a large capacity HDD containing many different types of files, making the whole process excessively long and painstaking.

These defects are really annoying for a program which has existed for more than 15 years and is recognized internationally as a reference in its category.

Other minor issues and suggestions of ergonomy improvements :

– Beyond 2TB, the names of the extracted files no longer match their location / first sector number, most likely because of a 32 bits limit in the calculations, so a value of 2^32 = 4294967296 must be added to get the correct value. Because of this issue, it's possible to have two extracted files with the same name, for instance a f4194304.jpg file located at sector 4194304 (or 2GB from the beginning of the volume), and a f4194304.jpg file located at sector 4194304 + 4294967296 = 4299161600 (or 2050Go from the beginning of the volume), which should be named f4299161600.jpg to reflect its actual first sector number.

– For “broken” files as well as JPG thumbnails, the naming scheme “b1234567.ext” or “t1234567.ext” is not practical : for thumbnails, it makes it impossible to have them appear (when files are sorted by name) along with the full size JPG files which normally contain them, which start at the same sector and should have the same number in their name ; and for “broken” files it complicates the identification of errors such as those described above, where an erroneously detected file signature interrupted the recovery of a file which is otherwise valid and contiguously stored on the analysed device. So it would be better to change the naming scheme to something like : “f1234567.ext” (“broken”) and “f1234567[t].ext” (thumbnails).

- Regarding thumbnails, there should be an option to disable their extraction, as these files are almost always contained in full size JPG files which are correctly recovered by Photorec, they have little use by themselves, while they clutter the recovery folders, and their extraction seems to generate significant slow-downs in the analysis.

– The list of file types begins to be very long and thus impractical, it would be good to organise it in categories, with a hierarchy similar to that of R-Studio for instance :
Archive (7z, rar, zip...)
Development files (c, lib, res...)
Disk images (vhd, vdi...)
Document (doc, pdf...)
Document database (mdb, mdf, dbf...)
Document spreadsheet (xls, xlsx...)
Executable, Library, DLL (exe, dll, sys...)
Font (otf, ttf...)
Graphics, Picture (jpg, gif, png...)
Internet related files (html, dbx, mht, pst, msf, wab...)
Multimedia Audio (flac, mp3, ogg, wav, wma...)
Multimedia Video (avi, mp4, mkv, mpg, wmv...)
Other files

– Why is the photorec.ses file deleted right at the end of the process ? It can be useful to verify which areas were left unidentified at the end of the analysis, for future reference...

– Photorec should be able to detect the presence of a file with the same name and size already present in the extraction folder (but in a formerly created sub-folder), and in this case not extract it again (in the situation where one has to restart the recovery from the beginning or from an earlier location, for instance after modifying the list of activated file types).

Regarding TestDisk :

– On the page http://www.cgsecurity.org/wiki/Advanced ... MFT_Repair, it is mentioned that TestDisk can repair the MFT on a NTFS partition by comparing it to its mirror, but in reality the MFT “mirror” is only a very partial copy, containing only the first 4 MFT records (corresponding to 4 crucial system files : $MFT = the MFT itself, $MFTMirr = the MFT “mirror”, $LogFile and $Volume), which are necessary for the operating system to access the partition, but can't help for the recovery of personal files, unless some sort of corruption affected very specifically the first few sectors of the MFT and didn't similarly affect the MFT “mirror”, which must be very rare. (Case in point, I've had a situation where the unique partition of a 2TB HDD had become suddenly unreadable, it turned out that MFT records 2 to 4 had been wiped, and wiped in exactly the same way in $MFTMirr, therefore TestDisk could not repair anything, so I had to fix it manually, after some in-depth research about the structure of MFT records in general, and those three in particular.)




[FRENCH]

En analysant avec Photorec un disque dur de 3To formaté, préalablement analysé avec R-Studio (qui avait reconstitué une bonne partie du système de fichiers d'origine, donc les fichiers récupérés par ce logiciel étaient complets, sauf ceux qui avaient été recouverts partiellement ou totalement), j'ai découvert plusieurs anomalies (j'utilisais alors Photorec 7.1 WIP, je n'ai pas encore testé la 7.2).

– L'analyse devient parfois extrêmement lente, et le compte du secteur présentement lu peut alors osciller longuement entre deux intervalles de valeurs (dans ce cas, approximativement entre 3200000000 et 3800000000). D'après ce que j'ai compris, si un fichier manifestement fragmenté est rencontré, Photorec essaie de trouver le fragment complémentaire en parcourant parfois plusieurs dizaines voire centaines de giga-octets, ce qui semble absurde du point de vue de l'efficacité. Il serait bien plus rapide d'extraire en premier lieu les fichiers identifiés pour lesquels tous les secteurs sont localisés séquentiellement, puis de marquer ces zones comme déjà analysées dans le fichier de sauvegarde de session photorec.ses, puis seulement une fois que tous ces fichiers ont été extraits, s'occuper de tenter de reconstituer les fichiers fragmentés. (Il devrait d'ailleurs y avoir une option pour désactiver complètement toute tentative de reconstruction de fichiers fragmentés, celle-ci ayant de toute façon une très faible probabilité de succès.) Ou alors si ce n'est pas la cause de ces oscillations (car j'ai constaté que les fichiers identifiés comme “broken” n'étaient effectivement récupérés qu'à la toute fin de l'analyse), celles-ci sont d'autant plus incompréhensibles.

– J'ai obtenu lors de cette analyse plusieurs centaines de fichiers “MP3” de 2 à 12Ko, évidemment invalides. Or, en comparant avec WinHex certains fichiers (vidéo notamment) identifiés comme tronqués (“broken”) avec leur homologue récupéré intact par R-Studio (détecté avec DoubleKiller, logiciel de détection de doublons, par analyse partielle sur les premiers Ko), j'ai constaté que le fichier récupéré par Photorec – par exemple un fichier WMV de 1,1Go – était interrompu au niveau de ce qui (dans le fichier valide) ressemblait à une signature d'en-tête MP3 (“ÿû” en ASCII, ou “FF BB” en hexadécimal), après quoi il manquait un segment correspondant justement à l'un de ces faux fichiers MP3 de 2 à 12Ko, puis la suite du fichier était jointe arbitrairement au niveau de la coupure ; et à la fin le fichier continuait au delà de sa limite d'origine jusqu'à atteindre à peu près la même taille (je suppose qu'en l'occurrence l'en-tête WMV comporte un champ indiquant la taille du fichier natif, que Photorec utilise pour déterminer la taille attendue du fichier récupéré). Bien évidemment le fichier extrait par Photorec était illisible après cette coupure, et pourtant il aurait été aisé de le récupérer en totalité puisque tous ses secteurs étaient disposés séquentiellement (ce que j'ai vérifié avec l'analyseur hexadécimal intégré à R-Studio, dans l'onglet “Sectors”). Si une autre fausse signature MP3 est rencontrée le même processus se reproduit à l'identique – j'ai ainsi eu certains fichiers vidéo criblés de petits “trous”, avec extraction concomitante de plusieurs dizaines de faux fichiers MP3...
Mes compétences actuelles s'arrêtent là, je ne sais ce qu'il faudrait changer pour que dans un tel cas le fichier soit récupéré correctement. Clairement, identifier les fichiers MP3 à partir de seulement deux caractères en début de cluster est insuffisant et produira de nombreux faux positifs (l'un de ces faux fichiers MP3 avait pour signature “ÿóÔ” en ASCII, ou “FF F3 D4” en hexadécimal, ce qui ne correspond à rien que j'aie déjà vu pour un fichier MP3 valide). Dans un tel cas, il faudrait modifier l'algorithme de sorte qu'en cas d'interruption d'un fichier par un autre fichier de taille aberrante (2-12Ko est une taille aberrante pour un fichier MP3 : ça fait moins d'une seconde en 128kb/s) l'extraction du fichier précédent se fasse quand même de façon séquentielle sans tenir compte de ce fichier manifestement erroné (ou à la rigueur que celui-ci soit quand même extrait de façon optionnelle, à des fins de vérification, en indiquant dans le nom que le fichier est probablement invalide et appartient au fichier plus volumineux extrait juste avant). Ou encore, pour les fichiers disposant d'une indication de leur taille dans l'en-tête (ce qui semble être le cas pour les WMV), il faudrait qu'immédiatement après avoir lu l'en-tête le logiciel aille vérifier si le dernier secteur (d'après cette information) correspond à ce qui est attendu (là je ne sais pas exactement comment cela pourrait être analysé, sachant que très peu de types de fichiers comportent un indicateur de fin, les fichiers JPG étant parmi les rares qui ont ont un... mais j'ai constaté empiriquement qu'on trouvait le plus souvent une moindre densité de données à la fin d'un fichier vidéo, par rapport à une zone prise au hasard au milieu du fichier, et en général il y a une zone vide juste après la fin du fichier, mais ce n'est pas toujours le cas, notamment si le fichier en a recouvert un autre, auquel cas il reste des données aléatoires en “slack space”), et auquel cas traiter le fichier comme entièrement séquentiel même si un hypothétique en-tête d'un autre type est détecté au cours de l'extraction. Ou encore, dans le cas des fichiers MP3, il doit être possible de vérifier la présence de champs spécifiques correspondant par exemple au débit ou à la fréquence d'échantillonnage.
La même chose arrive aussi fréquemment avec de fausses signatures JPG (“ÿØÿ” en ASCII, ou FF D8 FF en hexadécimal) ; d'ailleurs dans ce cas le fichier valide est tronqué net, au lieu d'être coupé et joint comme dans le cas des fausses signatures MP3 (ce qui à la limite est préférable, car ces micro-coupures arbitraires sont autrement très difficiles à détecter, surtout si on n'a pas les fichiers valides correspondants à disposition, ce qui est généralement le cas quand on a recours à un logiciel comme Photorec qui tel l'Agence Tous Risques fait office de « dernière chance au dernier moment »). Il me semble que des critères simples devraient permettre d'exclure la plupart de ces faux positifs (là encore, la densité ou l'entropie des données dans un en-tête valide est inférieure à celle que l'on trouve par exemple en plein milieu d'un fichier vidéo, et celui-ci doit contenir des champs spécifiques indiquant la résolution et d'autres valeurs, qui seront absents dans le cas d'un artefact). C'est d'autant plus étonnant que, d'après son nom, Photorec a été conçu primordialement pour récupérer des fichiers d'images et devrait donc être particulièrement optimisé pour distinguer les fichiers d'images valides des artefacts. Ces résultats semblent d'ailleurs contradictoires avec les indications fournies dans la description officielle du logiciel, reprise dans l'article Wikipédia ; il y est indiqué qu'un fichier JPEG est identifié à partir de trois valeurs possibles de signature : “FF D8 FF E0”, “FF D8 FF E1”, “FF D8 FF FE”, or j'ai eu des fichiers .jpg erronés récupérés par Photorec commençant par “FF D8 FF” suivi de n'importe quel autre caractère, ce qui réduit nettement la spécificité statistique de l'identification.
J'ai aussi eu des cas d'interruption par de faux fichiers ZIP (signature “50 4B 03 04”) ou encore de faux fichiers MPG (signature “00 00 01 BA”).
Il peut aussi arriver qu'un fichier valide soit tronqué du fait de la présence de fichiers valides d'un autres type que celui-ci intègre sous forme non compressée ; par exemple un fichier PDF intégrant des images JPG est extrait avec des “trous” (par rapport au fichier d'origine) correspondant à ces fichiers JPG, extraits à part par Photorec (c'est intéressant car dans ce cas le fichier PDF résultant est quand même lisible mais toutes les images d'illustration présentes dans le fichier d'origine ont disparu).
Pour le cas des images ISO, celles-ci pouvant contenir toutes sortes d'autres types de fichiers sous forme non compressée et identifiables par leur en-tête, le problème est encore plus pregnant car même dans le cas où c'est un fichier valide qui est détecté (par exemple vidéo MPG contenue dans une image ISO d'un jeu), l'extraction de l'image ISO est interrompue et le fichier résultant inutilisable, alors même qu'il était enregistré séquentiellement et aurait pu être récupéré en intégralité.
Encore une fois, le comportement par défaut, pour les types de fichiers disposant dans leur en-tête d'une information concernant la taille, devrait être d'extraire séquentiellement les secteurs jusqu'à l'obtention de la taille attendue. En l'état actuel la seule solution pour contourner ce type de problème serait de faire une analyse complète pour chaque type de fichier recherché, en dé-sélectionnant tous les autres types de fichiers dans les options, ce qui n'est évidemment pas une solution pratique, surtout pour un disque dur de grande capacité comportant de nombreux types de fichiers, rendant la procédure excessivement longue et fastidieuse pour l'obtention d'un résultat optimal.

Ces défauts me semblent franchement dommageables pour un logiciel qui existe depuis une quinzaine d'années et qui est reconnu internationalement comme une référence dans sa catégorie.

Autres défauts de moindre importance, et suggestions d'améliorations ergonomiques :

– Au delà de 2To, les noms de fichiers ne correspondent plus à l'emplacement du premier secteur, vraisemblablement à cause d'une limite de calcul en 32 bits, donc il faut ajouter 2^32 = 4294967296 pour avoir la valeur correcte. Il est même possible à cause de cette anomalie d'avoir deux fichiers ayant exactement le même nom, par exemple un fichier f4194304.jpg localisé au secteur 4194304 (soit à 2Go du début du volume) et un fichier f4194304 localisé au secteur 4194304 + 4294967296 = 4299161600 (soit à 2050Go du début du volume), dont le nom devrait être f4299161600.jpg selon son véritable numéro de premier secteur.

– Aussi bien pour les fichiers tronqués (“broken”) que pour les miniatures de fichiers JPG, la notation "bXXXXXXX.ext" ou "tXXXXXXX.jpg" n'est pas pratique : pour les miniatures cela empêche, avec un classement par nom, de faire apparaître celles-ci dans la continuité des fichiers en pleine résolution qui les contiennent, lesquels commencent au même secteur et doivent donc avoir le même numéro dans leur nom ; et pour les fichiers tronqués cela complique l'identification d'éventuelles erreurs telles que décrites ci-dessus, où une signature de fichier détectée de façon indue a interrompu la récupération d'un fichier autrement valide et enregistré de façon contiguë sur le support analysé. Il serait donc préférable de changer le schéma de nommage de ces fichiers, par exemple comme ceci : "fXXXXXXX.ext" (“broken”) et "fXXXXXXX[t].jpg" (“thumbnails”).

– Concernant les miniatures, il devrait y avoir une option pour en désactiver l'extraction, ces fichiers étant presque toujours contenus dans des fichiers JPG en pleine résolution correctement récupérés par Photorec, ils ont peu d'utilité en tant que tels, encombrent les dossiers de récupération, et leur récupération semble de plus provoquer de forts ralentissements de l'analyse.

– La liste des types de fichiers commence à être très longue et donc peu pratique, il serait peut-être judicieux de l'organiser en catégories, par exemple en s'inspirant de la hiérarchie des "Extra found files" dans R-Studio :
Archive (7z, rar, zip...)
Development files (c, lib, res...)
Disk images (vhd, vdi...)
Document (doc, pdf...)
Document database (mdb, mdf, dbf...)
Document spreadsheet (xls, xlsx...)
Executable, Library, DLL (exe, dll, sys...)
Font (otf, ttf...)
Graphics, Picture (jpg, gif, png...)
Internet related files (html, dbx, mht, pst, msf, wab...)
Multimedia Audio (flac, mp3, ogg, wav, wma...)
Multimedia Video (avi, mp4, mkv, mpg, wmv...)
Other files

– Pourquoi est-ce que le fichier photorec.ses est effacé automatiquement juste à la fin du processus ? Celui-ci peut être utile pour vérifier quelles zones sont restées non identifiées à l'issue de l'analyse, pour référence ultérieure...

– Photorec devrait pouvoir détecter la présence d'un fichier du même nom et même taille déjà présent dans le même dossier d'extraction (mais dans un sous-dossier antérieur), et dans ce cas ne pas l'extraire une deuxième fois (pour le cas où on doit recommencer l'extraction depuis le début ou un point antérieur, par exemple après avoir modifié la liste des types de fichiers activés).

Concernant TestDisk :

– À la page http://www.cgsecurity.org/wiki/Advanced ... MFT_Repair il est indiqué que TestDisk peut réparer la MFT en la comparant à son miroir, mais en réalité le “miroir” de la MFT est une copie très partielle, qui ne comprend que les quatre premiers enregistrements de fichiers (correspondant à des fichiers système cruciaux : la MFT elle-même, le “miroir”, la $LogFile et $Volume), nécessaires au système d'exploitation pour accéder à la partition, mais ne pouvant aider à restaurer l'accès à des fichiers personnels, sauf si une corruption affecte tout particulièrement les premiers secteurs de la MFT et n'a pas été répercutée au niveau de son miroir, ce qui doit être très rare. (De fait, il m'est arrivé d'avoir l'unique partition d'un disque dur de 2To soudainement illisible, il s'est avéré que les enregistrements MFT 2 à 4 avaient été vidés, et vidés exactement de la même façon dans $MFTMirr, par conséquent TestDisk n'a rien pu réparer, et j'ai dû résoudre cette anomalie manuellement, après une recherche approfondie sur la structure des enregistrements MFT en général, et ces trois-là en particulier.)

recuperation
Posts: 2720
Joined: 04 Jan 2019, 09:48
Location: Hannover, Deutschland (Germany, Allemagne)

Re: Various issues with Photorec / suggestions of improvements

#2 Post by recuperation »

BitterColdSoul wrote: 08 Jun 2020, 02:59 [ENGLISH]

Running Photorec on a formatted 3TB HDD, previously scanned with R-Studio (which retrieved a good portion of the original filesystem, so the files it recovered were complete, except for those which had been partially or totally overwritten), I discovered several issues (I was then using Photorec 7.1 WIP, I have yet to test version 7.2).

– Sometimes the scan can become extremely slow, and the count of the sector currently read keeps oscillating between two ranges of values (in this case, between 3200000000 and 3800000000 approximately). From what I can understand, if a file is encountered which looks fragmented, Photorec tries to find the complimentary fragment, sometimes parsing dozens or even hundreds of gigabytes,
I doubt it but can't tell you for sure.
which seems absurd from an efficiency standpoint. It would be much quicker to first extract the identified files for which all clusters are located sequentially, then mark those areas as already processed in the “photorec.ses” session backup file, and then only once all those files have been extracted, attempt to reconstruct fragmented files. (Actually, there should be an option to disable attempts to reconstruct fragmented files altogether,
You did not understand my explanation. Photorec cannot recognize defragmentation, therefore such a switch is useless.
as those are unlikely to be successful anyway.) Or if that is not the cause of those oscillations (since I noticed that some files identified as “broken” were indeed recovered at the very end of the process), then it's even more of a mystery.

– In this recovery I got hundreds of “MP3” files with a size ranging from 2 to 12KB, obviously invalid. Yet when comparing with WinHex some files (video files in particular) identified as “broken” with their counterparts recovererd completely by R-Studio (detected with DoubleKiller, a duplicate file finder, set to check the first few KB only), I found out that the file recovered by Photorec – for instance a 1.1GB WMV file – was interrupted by what (in the valid file) looked like a MP3 header signature (“ÿû” in ASCII, or “FF BB” in hexadecimal), and then a chunk was missing which happened to match one of these fake 2-12KB MP3 files, then the rest of the file was arbitrarily merged at the position where the missing chunk should have been ; and at the end the file continued beyond its original limit, so that the total size was approximately identical despite the missing chunk (in this case it's probably because the WMV header contains information about the total size of the file, which Photorec uses to determine the expected size of the extracted file). Of course the file extracted by Photorec was unreadable beyond this cut, and yet that file could have been easily recovered in its entirety since all its sectors were located sequentially (which I verified with R-Studio's hexadecimal viewer, in the “sectors” tab). If another fake MP3 signature is found, the same process happens again – and so I've had some video files riddled with small holes, with a simultaneous extraction of dozens of fake MP3 files...
You cannot distingush legitimate file information in a file that looks like fingerprint from a fingerprint at the start of a file.
My current knowledge stops here, I don't know what should be changed so that in such a case the file gets correctly recovered. Clearly, identifying MP3 files with only two characters at the beginning of a cluster is insufficient and bound to generate many false positives (one of these fake MP3 was beginning with “ÿóÔ” in ASCII, or “FF F3 D4” in hexadecimal, which doesn't even correspond to any actual MP3 header I've ever seen). In such a case, the algorithm should be modified in such a way that, in case of an interruption of one file by another which has an unlikely size (2-12KB would be an unlikely size for a MP3 file : that's less than 1s at a 128kb/s bitrate), the extraction of the former file should continue sequentially, discarding the obviously erroneous file as an artifact (or optionally extracting it as well for verification purposes, with a name indicating that it's most likely an invalid file, actually belonging to the larger file extracted right before). Or, for files which store information regarding their size in their header, Photorec should, right after parsing the header, verify if the last sector (based on that information) corresponds to what would be expected (I'm not sure how it should be analyzed, since very few file types have an actual “end of file” marker,
So this proposal is essentially useless.
JPG files being among the rare ones that do... but I noticed empirically that most often the end of a video file has a lower “density” of data compared with a random segment in the middle, and there's generaly an empty space right after the end of the file, but that's not always the case, for instance if the file has overwritten another there can be random data in “slack space”), in which case the file should be treated as entirely sequential, even if a hypothetical header of another type is identified during the extraction. Or, for MP3 files, it must be possible to check the presence of specific fields corresponding to the bitrate or the sampling rate.
The same also happens frequently with fake JPG signatures (“ÿØÿ” in ASCII, or “FF D8 FF” in hexadecimal). In this case a valid file seems to be cut short, instead of cut and merged with what comes after as in the case of fake MP3 signatures (which is actually better, because those arbitrary micro-cuts are otherwise very difficult to detect, if the user doesn't have the corresponding valid files to compare with, which is generally the case when using Photorec which, like the A-Team, works wonders when all else fails). It seems to me that some simple criterions would allow to exclude most of those false positives (again, the density or entropy of data in a valid header is significantly inferior to that of a chunk at a random position inside a video file, and it should contain specific fields indicating the resolutions and other values, which would be absent in an artifact). This is especially surprising since, based on its name, Photorec was primarily conceived to recover picture files, and should therefore be particularly optimized to distinguish valid picture files from artifacts. Those results also seem to contradict informations provided in the official description of the software, copied in the Wikipedia article ; it is said that a JPEG file is identified by 3 possible signatures : “FF D8 FF E0”, “FF D8 FF E1”, “FF D8 FF FE”, yet I've had erroneous .jpg files recovered by Photorec beginning with “FF D8 FF” followed by seemingly any character, which significantly reduces the statistical specificity of identification.
I have also seen valid files interrupted by fake ZIP files (signature “50 4B 03 04”) or fake MPG files (signature “00 00 01 BA”).
It can also happen that a valid file is interrupted by the presence of an actual file of another type embedded in itself in uncompressed form ; for instance a PDF file containing JPG pictures is extracted with “holes” (compared with the original file) corresponding to those JPG pictures, extracted separately by Photorec (it's interesting because in this case the resulting PDF file is still readable, but all picures that used to be in the original file are missing).
In the case of ISO images, the problem is compounded as these can contain all types of files in uncompressed form and therefore identifiable by their signature, so even if a valid file is detected (for instance a MPG video in a game ISO), the extraction of the ISO image stops and the resulting file is invalid, even though it was stored sequentially and could have been recovered entirely.
Once again, the default behaviour, for all files which store information regarding their size in their header, should be to extract sectors sequentially until the expected size is reached. As it is now, the only way to circumvent this kind of issues would be to run a complete analysis for each file type of interest, by unchecking all the other file types in the options, which of course is not a practical solution, especially for a large capacity HDD containing many different types of files, making the whole process excessively long and painstaking.

These defects are really annoying for a program which has existed for more than 15 years and is recognized internationally as a reference in its category.

Other minor issues and suggestions of ergonomy improvements :

– Beyond 2TB, the names of the extracted files no longer match their location / first sector number, most likely because of a 32 bits limit in the
calculations, so a value of 2^32 = 4294967296 must be added to get the correct value. Because of this issue, it's possible to have two extracted files with the same name, for instance a f4194304.jpg file located at sector 4194304 (or 2GB from the beginning of the volume), and a f4194304.jpg file located at sector 4194304 + 4294967296 = 4299161600 (or 2050Go from the beginning of the volume), which should be named f4299161600.jpg to reflect its actual first sector number.
It's not the sector number, read the documentation "11.11 PhotoRec: file name and date" to understand why there could be two files labeled identically.

– For “broken” files as well as JPG thumbnails, the naming scheme “b1234567.ext” or “t1234567.ext” is not practical : for thumbnails, it makes it impossible to have them appear (when files are sorted by name) along with the full size JPG files which normally contain them, which start at the same sector and should have the same number in their name ; and for “broken” files it complicates the identification of errors such as those described above, where an erroneously detected file signature interrupted the recovery of a file which is otherwise valid and contiguously stored on the analysed device. So it would be better to change the naming scheme to something like : “f1234567.ext” (“broken”) and “f1234567[t].ext” (thumbnails).

- Regarding thumbnails, there should be an option to disable their extraction, as these files are almost always contained in full size JPG files which are correctly recovered by Photorec, they have little use by themselves, while they clutter the recovery folders, and their extraction seems to generate significant slow-downs in the analysis.

– The list of file types begins to be very long and thus impractical, it would be good to organise it in categories, with a hierarchy similar to that of R-Studio for instance :
Archive (7z, rar, zip...)
Development files (c, lib, res...)
Disk images (vhd, vdi...)
Document (doc, pdf...)
Document database (mdb, mdf, dbf...)
Document spreadsheet (xls, xlsx...)
Executable, Library, DLL (exe, dll, sys...)
Font (otf, ttf...)
Graphics, Picture (jpg, gif, png...)
Internet related files (html, dbx, mht, pst, msf, wab...)
Multimedia Audio (flac, mp3, ogg, wav, wma...)
Multimedia Video (avi, mp4, mkv, mpg, wmv...)
Other files

This is complete pointless for a fingerprint reader.
You ignore the fact that from a fingerprinting viewpoint you cant distinguish a zip file from later (zipped!) office file formats.

– Why is the photorec.ses file deleted right at the end of the process ? It can be useful to verify which areas were left unidentified at the end of the analysis, for future reference...

– Photorec should be able to detect the presence of a file with the same name and size already present in the extraction folder (but in a formerly created sub-folder), and in this case not extract it again (in the situation where one has to restart the recovery from the beginning or from an earlier location, for instance after modifying the list of activated file types).

Regarding TestDisk :

– On the page http://www.cgsecurity.org/wiki/Advanced ... MFT_Repair, it is mentioned that TestDisk can repair the MFT on a NTFS partition by comparing it to its mirror, but in reality the MFT “mirror” is only a very partial copy, containing only the first 4 MFT records (corresponding to 4 crucial system files : $MFT = the MFT itself, $MFTMirr = the MFT “mirror”, $LogFile and $Volume), which are necessary for the operating system to access the partition, but can't help for the recovery of personal files, unless some sort of corruption affected very specifically the first few sectors of the MFT and didn't similarly affect the MFT “mirror”, which must be very rare. (Case in point, I've had a situation where the unique partition of a 2TB HDD had become suddenly unreadable, it turned out that MFT records 2 to 4 had been wiped, and wiped in exactly the same way in $MFTMirr, therefore TestDisk could not repair anything, so I had to fix it manually, after some in-depth research about the structure of MFT records in general, and those three in particular.)

This is a valid point. The term "backup" needs to be replaced by (its partial minimal backup).

As the source code for Testdisk and Photorec is freely accessible you can program your ideas and test them but you might want to dive deeper into the matter to avoid wasting time on ideas that cannot be realized.
Last edited by recuperation on 09 Jun 2020, 17:56, edited 1 time in total.

recuperation
Posts: 2720
Joined: 04 Jan 2019, 09:48
Location: Hannover, Deutschland (Germany, Allemagne)

Re: Various issues with Photorec / suggestions of improvements

#3 Post by recuperation »

.

BitterColdSoul
Posts: 50
Joined: 07 Jun 2020, 20:38
Location: France

Re: Various issues with Photorec / suggestions of improvements

#4 Post by BitterColdSoul »

I'm only reading the feedback now. Sorry if some statements were a bit harsh in my original post, but I maintain that this was all valid criticism based on thorough tests, with enough knowledge, I think, to distinguish between wishful thinking and potentially implementable improvements.
You did not understand my explanation. Photorec cannot recognize defragmentation, therefore such a switch is useless.
11.9 : « During pass 1 and later, files are recovered including some fragmented files. »
Apparently, from my understanding, if a file ends before its expected size is reached (based on metadata information), Photorec tries to find the next matching fragment -- and generally fails since the next fragment could be located anywhere (sometimes the next fragment is located before the beginning of the file, making it almost impossible to guess programatically, unless the algorithm is aware of the specific structure of a given file type and knows what to expect, which must be tremendously more complex, and is certainly beyond the scope of a freeware).
You cannot distingush legitimate file information in a file that looks like fingerprint from a fingerprint at the start of a file. So this proposal is essentially useless.
I provided specific scenarios where something happens that should not happen : a perfectly valid and contiguous file being recovered as an invalid file because dozens of false MP3 signatures (or a single false JPG signature) were randomly found within the data -- and in most cases the user has no way of knowing it (if the filesystem is damaged to the point that file signature search is the only possible recovery method).
Again, if the fingerprint is small (like three bytes), it is statistically bound to produce many false positives, unless there is some kind of integrity check to prevent that from happening.
I proposed potential solutions to circumvent that issue, which would seem relatively simple to implement (for instance, reject any JPG file that doesn't have dimensions information where it should be, or any MP3 file that doesn't have bitrate / sampling rate information, that sort of thing). At the very least, there should be a clear warning in the manual that this sort of thing can happen, and that if only a specific file type is needed, it's best to only check this file type and uncheck all others in "File options".
It's not the sector number, read the documentation "11.11 PhotoRec: file name and date" to understand why there could be two files labeled identically.
11.1 says : « The number is calculated by using the file location minus the partition offset divided by the block size. For some filesystems like NTFS, exFAT, ext2/3/4, this number may be identical to the original cluster/block number. »
Yet if I open in WinHex a NTFS volume on which I did a recovery with Photorec, and type a recovered file's number as "sector number", WinHex displays the start of that file. Unless the file is located beyond 2TB, as I explained, in which case it does display the correct first sector if I add 2^32 = 4294967296 to the file's number. That would be an insane coincidence if the explanation was totally different.
And the sentence above does not explain why there could be two files labeled identically -- either that or I'm really low on blood sugar right now.
This is complete pointless for a fingerprint reader.
You ignore the fact that from a fingerprinting viewpoint you cant distinguish a zip file from later (zipped!) office file formats.
I'm wondering if you read what I wrote or what you thought I had been meaning to say based on common requests from complete n00bs.
There is a list of file types, for the user to check / uncheck. That list is now very long. It might be more practical to have that list organized by categories, that's all. If for instance a user specifically wants to recover AVI, MP4 and ZIP, those lines are located quite far apart, and it's necessary to parse the whole list to check those. If there were for instance an "Archive files" group and a "Audio / video files" group, it would be more straightforward to do the selection, without bothering with the other file types which are irrelevant in this situation ; actually many file types are irrelevant to most users who don't even know what they are, and to avoid false positives as much as possible (see above) it's better to only check the file types that are expected on a given storage volume. A user who knows enough about file types to tinker with "File options" certainly knows that, yes, or course, any file can be put into a ZIP / RAR / 7Z archive, and can decide if those types should be included or not based on what they remember about what was on the analysed storage volume, or what the friend / relative / client remembers.
As the source code for Testdisk and Photorec is freely accessible you can program your ideas and test them but you might want to dive deeper into the matter to avoid wasting time on ideas that cannot be realized.
I'm not a programmer (see how I struggle to get a small script working !) but I think that I have enough practical experience to see what could be improved, for the benefit of anyone using Photorec in the future and trusting it to recover as much useful data as possible in nearly hopeless situations.

Locked