Digestwith Constant Memory Usage
Many SaaS companies offer features to allow their users to upload files into their products. Once the files are uploaded, they usually need to be processed in some way. During processing, it’s a ubiquitous requirement, to compute hash values of the files’ content using some sort digest algorithm (e.g., MD5, SHA-1, SHA-2, …). All digest algorithms take an input of arbitrary length and return a so-called hash value of fixed length (e.g., 256 bits). The hash value is computed from the input in an irreversible way, meaning you can apply the algorithm to calculate the hash value, but you cannot go in the other direction (i.e., you cannot go back to the original input from the hash value).
Hash values are designed to be collision-resistant: For any two inputs, the probability to compute the same hash value is very low. But, if the two inputs are identical, the calculated hash value is guaranteed to be equal as well. Digests have many applications, but to give you one example: They are commonly used for duplicate detection. Thus, you can quickly check if a given file has been uploaded before by comparing the hash value of the new file to the hash values of all the old files, which you would presumably keep in your database.
The point of this article is that you have to be careful when computing the hash value of user-provided files because generally, you can’t make any assumptions about the size of these files. Most resources on the Internet suggest something like this:
require 'digest' path_to_file = '/tmp/file' file_content = File.read(path_to_file) sha2 = Digest::SHA2.new.update(file_content) puts sha2.hexdigest # Result: SHA2 of the file's content
This code has as a non-obvious problem:
File.read(path_to_file) is loading the file’s entire content into memory. This is no problem if the uploaded file is small enough to fit into your memory, but for large files, it could lead to out of memory errors. Remember, we can’t make any assumptions about the uploaded file’s size. We don’t know if the file is in the kilobyte or terabyte-range. Kilobytes would be no problem, but terabytes would almost certainly crash your server.
The only solution is, of course, to avoid reading the whole file at once. Instead, we have to read the file incrementally and feed it to the digest algorithm in smaller chunks. Fortunately, Ruby’s
Digest#update is designed for exaclty this use-case. Observe:
irb(main):001:0> require 'digest' => true irb(main):002:0> sha2 = Digest::SHA2.new => #<Digest::SHA2:256 e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855> irb(main):003:0> sha2.update('first piece').hexdigest => "13562d94055d9f7cb35a5ed89a6e750463074fae10f2a37f1d3e07ee4595c657" irb(main):004:0> sha2.update('second piece').hexdigest => "42b5016a302b7947c1273552f8c1062034f6770bc572eb4db6e5131e766b1dad"
As you can see, the hash value changes after each call of
Digest#update. Therefore, it’s no problem to feed the input to the digest algorithm incrementally. Putting all of this together:
require 'digest' path_to_file = '/tmp/file' sha2 = Digest::SHA2.new DIGEST_BUFFER_SIZE = 64 * 1024 # 64 kb File.open(path_to_file) do |f| while buffer = f.read(DIGEST_BUFFER_SIZE) sha2.update(buffer) end end puts sha2.hexdigest # Result: SHA2 of the file's content
With this code, we finally have a safe solution to calculate hash values for user-provided files. This method will never read more 64 kb at once and thus keep memory consumption under control.
You would be surprised how often we, as a web development consultancy, see errors like this in the wild.
Software: MRI Ruby 2.6.2-p247.