2187-just_azure.svg

In the previous article in this series, I showed you how to use the Storage Client Library to do many of the operations needed to manage files in blob storage, such as upload, download, copy, delete, list, and rename. The CloudBlockBlob.UploadFile works fine, but it can be tuned for special cases such as very slow internet access.

When I worked for a startup, one of the things our desktop product did was upload a bunch of images and an MP3 file to Azure blob storage. The MP3 could be as large as 20 MB. Many of our customers lived in areas with broadband upload speeds of 1.0 mbps on a good day. When we tested using File.Upload on a 20MB file with minimal broadband speed, we found the upload would time out and eventually fail. It just couldn’t send up enough bytes and get a handshake back quickly enough to be successful.

In order to make our product work for all customers, we changed the upload to send the file up in blocks. The customer could set the block size. If the customer had pretty good internet speed (5 mbps or higher), they might set the block size as high as 1 MB. If they had pretty bad internet speed (1 mbps or lower), they could set the block size as low as 256kb. This is small enough for a block to be uploaded and the handshake completed, and then it could start on the next block.

In this article, I’m going to discuss two ways to upload a file in blocks. One way is to use the parameters that can be changed when calling the UploadFile method on the CloudBlockBlob object. The other way is to programmatically break the file into blocks and upload them one by one, then ask Azure to reassemble them.

Let’s start with using the built-in functions for uploading a file. I messed around with this a bit back in 2010-2011, but the properties as used back then are obsolete, and/or have been moved to different objects of the Storage Client Library since then. Bing-ing “SingleBlobUploadThresholdInBytes” only returned 8 articles. (Think about that. What have you searched for lately that only returned 8 results?) Most of the articles were from 2010-2011; the others were from MSDN, which offered a useful explanation like this: “This is the threshold in bytes for a single blob upload”. Wow, incredibly helpful.

I managed to track down someone on the Azure Storage team at Microsoft to help me understand this, so at the time of this writing, I think only three people in the world know how to use this correctly – me, the guy at Microsoft who owns it, and one of the other Azure MVPs. So after you read this, you will be part of a very elite group.

There are three properties directly involved.

SingleBlobUploadThresholdInBytes

This is the threshold in bytes for a single blob upload. (Haha! Kidding!) This setting determines whether the blob will be uploaded in one shot (Put Blob) or multiple requests (Put Block). It does not determine the block size. It basically says “if the file is smaller than this size, upload it as one block. If the file size is larger than this value, break it into blocks and upload it.”

The minimum value for this is 1MB (1024 * 1024). This means you can not use this to chunk files that are smaller than 1 MB. ParallelOperationThreadCount must be equal to 1 (more on that below). Also, this works with the Upload* API’s (such as UploadFile) but not to blob streams. If you use OpenWrite to get a stream and write to it, it will always be uploaded behind the scenes using Put Block calls.

This property is found in the BlobRequestOptions class. To use it, create a BlobRequestOptions object and then assign it to the CloudBlobClient’s DefaultRequestOptions property.

StreamWriteSizeInBytes

This sets the size of the blocks to use when you do a Put Blob and it breaks it into blocks to upload because the file is larger than the value of SingleBlobUploadThresholdInBytes.

By default, this is 4MB (4 * 1024 * 1024).

This is a property on the CloudBlockBlob object or CloudPageBlob object, whichever you are using. You can use this when streaming files up to Azure as well (like when you’re using UploadStream instead of UploadFile).

ParallelOperationThreadCount

This specifies how many parallel PutBlock or PutPage operations should be pending at a time.

If this is set to anything but 1, SingleBlobUploadThresholdInBytes will be ignored. After all, if you ask the file to be sent up in multiple threads, there’s no way to do that but to send it up in blocks, right?

This is a property of the BlobRequestOptions object.

All together now

So for example, if you use these values:

  • ParallelOperationThreadCount = 1
  • StreamWriteSizeInBytes = 256 * 1024 //(256 kb)
  • SingleBlobUploadThresholdInBytes = 1024 * 1024 //(1 MB)

and call blob.UploadFile, if the file is less than 1MB, it will use one Put Blob to upload it. If the file is larger than 1 MB, it will split it into 256kb blocks and send the blocks up as multiple requests.

You might also consider changing the default Retry Policy. If you’re chunking the file because you think the client will have problems uploading it because their internet connection is poor, you might want to set it to only retry once, or not at all. Otherwise it may time out, then wait X seconds and time out again, etc, when it will never succeed. For this reason, I’m only having it retry once in the code below.

Uploading a file using the .NET Storage SDK

In the code above, you can see that I create a BlobRequestOptions object, assign the values of SingleBlobUploadThresholdInBytes, ParallelOperationThreadCount, and RetryPolicy. Then after instantiating the CloudBlobClient, I set the DefaultRequestOptions to my BlobRequestOptions object. After getting a reference to the blob, I set the StreamWriteSizeInBytes. Then I upload the file.

If I turn fiddler on and use the code above to upload a 5MB file, I see multiple requests – one for each block. These calls are made consecutively because they are all running in a single thread (ParallelOperationThreadCount = 1).

2224-BlobStorage_4_Figure_01_FiddlerView

Figure 1: Fiddler View

And if I look at any one line, I can see the size of the request. For all but the last two, the block size is the same as StreamWriteSizeInBytes. The last two send out the remainder of the blocks.

2224-BlobStorage_4_Figure_02_FiddlerDeta

Figure 2: Fiddler Details

Upload a file in blocks programmatically

If you can set a couple of properties and upload a file in blocks easily, why would you want to do it programmatically? The case that immediately comes to mind is if you have files that are less than 1 MB and you want to send them up in 256kb blocks. The minimum value for SingleBlobUploadThresholdInBytes is 1 MB, so you can not use the method above.

Another case is if you want to let the user pause the upload process, then come back later and restart it. I’ll talk about this after the code for uploading a file in blocks.

To programmatically upload a file in blocks, you first open a file stream for the file. Then repeatedly read a block of the file, set a block ID, calculate the MD5 hash of the block and write the block to blob storage. Keep a list of the block ID’s as you go. When you’re done, you call PutBlockList and pass it the list of block ID’s. Azure will put the blocks together in the order specified in the list, and then commit them. If you get the Block List out of order, or you don’t put all of the blocks before committing the list, your file will be corrupted.

The block id’s must all be the same size for all of the blocks, or your upload/commit will fail. I usually just number them from 1 to whatever, using a block ID that is formatted to a 7-character string. So for 1, I’ll get “0000001”. Note that block id’s have to be a base 64 string.

Here’s the code for uploading a file in blocks. I’ve put comments in to explain what’s going on.

You can actually split the file up and upload it in multiple parallel threads. For my use case (customer has insufficient internet speed), that wouldn’t make sense. If he can’t upload chunks bigger than 256MB, then he can’t upload 2 or 3 or 4 of those at the same time. But if you have decent upload speed, you could definitely upload multiple blocks in parallel.

What if you what to give the customer the ability to start an upload, stop it, and resume it later? The customer is uploading a file with your application, and he hits pause and goes off to do something else for a while. When he hits pause, you just stop uploading the file. When he comes back and asks to resume the upload, call to get a list of the uncommitted blocks that have been uploaded and put each blockListItem.Name in a List<string>. Start reading the file from the beginning. Read each block in and create the blockID the same way you created it before. Add this to the list of blockIDs that you are going to use to commit all the blocks at the end. See if the blockID is in the list of uncommitted blocks. If it is, remove it from the list of uncommitted blocks because you’ve found it, and won’t find it again, so why bother leaving it in the search list? If the blockID is not in the list of uncommitted blocks, call PutBlock to write the block to Blob Storage.

After reading the whole file and putting all of the missing blocks, call PutBlockList with the list of blockIDs to commit the file.

This is pretty close to the same code as above, except it calls to get the list of uncommitted blocks, and does the check to see if the block is already committed before writing the block.

Instead of requesting a list of committed blocks from blob storage, you could keep track of the list on your own and store it somewhere on the customer’s computer. I’d rather query blob storage, it feels safer somehow because the list can’t be accessed by the customer. (It is, after all, his computer).

Another consideration you might think about is if the file the customer is uploading can be changed between the time he starts the upload and the time it finishes. When I used this upload method, I was taking a bunch of images and an mp3 file and creating a zip file with a unique name and uploading the zip file. The customer could find the zip file on the computer and mess with it, but it was extremely unlikely. Also, if the customer created another zip file, it would be queued after the first one, and start uploading after the first upload finished.

You can upload some blocks, wait a couple of days, upload some more blocks, wait another couple of days, etc. Uncommitted blocks will be cleared automatically after a week unless you add more blocks to the same blob or commit the blocks for the blob. Here’s the code you can use to retrieve the list of blocks; the print statement shows you the members you can access for each block, and you can see the blockListItem.Name and the property telling if it’s a committed block.

To test your code, you can run the regular upload in debug and stop it when it gets past a handful of blocks, then run the routine that checks for the block status, uploads the rest of the blocks, and commits all of the blocks.

One thing to note: You can add the code to get the blocks and do the check to see if they are committed before uploading them to your Upload routine. I chose not to do this, but to use an almost-identical copy with those code bits added in, because retrieving the list of blocks will take some small usage of the performance, so I only want to incur that hit when I know there is a possibility that the file has been stopped and needs to be restarted.

Summary

In this article, I showed you how to go were few men/women have gone before by using the properties of the CloudBlobClient and CloudBlockBlob to let Azure do the hard work of uploading a file in blocks for you. I also showed you how to programmatically do that yourself, in case you want to stop in the middle and continue later. In the next post in this series, I will show you how to use the REST API directly to access blob storage, including running the REST calls in PowerShell. (Oooooh, aaaaah.)