Deduplication can be a helpful technique when doing forensic investigations. The wealth of data that can be found on a computer or phone can get overwhelming and we often need a way to reduce the data we’re looking at.
Previously, I wrote a blog about false positives and how it’s important to embrace false positives in your forensic tools because if your tools aren’t getting any false positives, they’re likely missing important data.
I believe deduping can have a similar effect, since it can ensure that important data doesn’t get missed. By looking for something in only one spot, we risk missing it if it’s no longer present or malformed. The negative impact is that you’re typically dealing with larger datasets as at the expense of being thorough.
As I wrote in the blog above, I would rather get a hundred false positives than miss one false negative. I believe the same argument applies to duplicates—more is better to ensure important evidence isn’t missed. Once the data is collected, we, as examiners, can determine what is relevant or not. Forensics tools can help us analyze what data is relevant, but it is important to note that there is not a universally agreed upon definition of deduplication of data in forensics, and that’s a good thing.
Deduplication in Your Analysis
Most examiners would agree that deduplication can be helpful, but I doubt most examiners would agree on the best way to deduplicate data. This is simply because each investigation is different and each examiner will have different goals when they’re trying to reduce the data they need to review, otherwise each investigation would take six months to complete.
So, what are the most common ways to deduplicate data in forensic investigations? A few come to mind:
Hashing – Hashing has been a common technique in forensic analysis to identify known files or data since long before I started in forensics. It will continue to be used because it is relatively fast and can be automated. This is a common method to deduplicate data as well, especially when analyzing pictures or videos and understanding if two files are an exact match.
The downside to using hashing as a deduplication method in forensic analysis is that two copies of the same file or data could be important to the investigation. The existence of a picture sitting on a user’s desktop has a different investigative value than a picture found in the Windows install folder or perhaps the user’s browser cache. All could be important or only one of them but if you deduplicate the data based on the hash; these factors don’t get accounted for.
Looking at File Attributes – Another item that might point to deduplicated data may be by the attributes of that file such as the file name or size. This can be helpful when the content of the file may not matter but it’s attributes do. This could also simply be a different way to filter or sort existing data to find outliers. Often outliers can be quickly evaluated against large datasets to identify non-standard activity. For example, if a user logs into a system at 9:00AM and logs off at 5:00PM every Monday to Friday and then suddenly, that same user logs in at 3:00AM on a Saturday, it can be quickly identified and investigated as an outlier that doesn’t match expected behavior.
Seeing Source/Location Information – Finally, you can also deduplicate data based on where it was found. This could be done by file path or, for greater accuracy, the source/location of the data (or a combination of the two). The source/location could be the offset from a given file, the cluster offset for a logical partition or volume, or the sector offset of a physical disk. The measurement you use isn’t really relevant as long as it’s consistent.
Carving and Deduplication
Deduping based on the source isn’t very helpful if you’re only looking at allocated files because all files will have a different source and nothing will be duplicated but, if you start to introduce carving into your analysis, it can be helpful in situations where you may be looking at the same data more than once.
Every tool handles carving a little differently, both in the signatures they use and the methods carving gets deployed by the tool. Some tools will only carve in unallocated space which is useful and simple to implement because there’s rarely overlap between allocated or unallocated content. The downside here is that you’ll miss a lot of valuable data and potential evidence. Only carving unallocated space will miss common evidentiary locations such as file slack or RAM slack, signature mismatches in allocated files (ex. a JPG renamed as a ZIP) and many other scenarios where you may want to search.
Many tools allow the examiner to carve everything regardless if it’s allocated/unallocated or if it’s a recognized file system (or not, there is a lot of value in carving just raw data if you’re not sure what it is or what it may contain). This is how AXIOM uses carving. It uncovers files or fragments of data vital to an investigation. This can be quite helpful in finding hidden data. However, this comes at the cost of performance and an increase in results returned to be examined. In order to manage this increased amount of data, deduplication can be used either manually by the examiner or in some automated fashion built into the tool itself.
How Your Forensic Tools Use Deduplication
By choosing to parse allocated from the file system as well as carving data across the entire image, you risk having duplicate data returned by your forensic tool. The most common situation for this would be reporting both the allocated file and the same file being carved if there is a matching signature for it. If you choose to only carve unallocated space, this is less of a concern but, there are plenty of times when carving allocated data can uncover valuable evidence (carving data from file slack, deleted SQLite records, or a picture from within a document come to mind as three common examples).
It gets more complicated in situations where there are many versions of the same file in the same or slightly different locations. For example, most JPG pictures contain 1-3 thumbnails (or more) within the picture so, if you carve within the JPG, you’ll actually get multiple thumbnails carved as well as the original JPG file (not to mention that this gets multiplied for each copy of the file residing in different locations on the disk, ex. Volume shadow copies, thumbcache, recycle bin, data moving from resident to non-resident, slack, etc… to name a few). The challenge here is: do you display them all or some? You can’t dedupe these based on hash as the thumbnails will have different hash values than the original and deduping on the source location may or may not be the same depending on how it was recovered (AXIOM dedupes thumbnails inside parsed pictures that have already been found but each tool may handle this differently).
Another example is deleted files. Finding an $MFT record or inode for a deleted file is quite common and will be reported by most tools in the file system explorer as deleted without the need to carve for it. However, whether the content of that file is still available or overwritten by another file is important and not as easily determined. Most tools will only check if the first cluster is overwritten or not and carving for a given signature, such as a JPG picture, may also uncover the content of that file from unallocated space. The source locations for each of these would be different (one would be the $MFT record, while the other is unallocated space) but they are in fact the same file. Deduping based on hashing would also likely fail in this situation as well as it’s nearly impossible to carve a file to exactly match its original unless there are very specific headers and footers to identify the start and end of a given file.
This blog captures only a few examples to illustrate why you could get so many duplicate pictures or files and why it’s important to recognize that you may not always want to deduplicate everything. There are many different ways to deduplicate your data, but the three listed above are quite common and each one has its pros and cons depending on your situation or investigation. You could deduplicate data based on any criteria you wish, as long as you as the examiner understand how that impacts the data that gets presented to you (just like any other filter).
As always, if you have any questions or comments, please feel free to contact me: email@example.com