Understanding encoding to avoid accidental data loss with Export-CSV and other output cmdlets
This post explores how a nuance with PowerShell encoding can lead to data loss in output files. I’ll explain the problem, provide some background on encoding to show why this happens and then demonstrate how to avoid it.
Context and Problem Demonstration
For a bit of context, I came across this while working on a data restoration script for SharePoint Online where the intent was to restore about 30,000 files. For those of you that have experience with bulk operations in SharePoint, using PowerShell or other third-party applications, you will know that if you submit a lot of requests at a certain point Microsoft will throttle you. This throttling introduces a delay in the execution of your next request and can get pretty long which means your task takes an age to complete. To avoid this, I wanted to split my bulk restore operations into manageable batches that I’d then kick off at a later date, hence I exported the data I needed to CSV so I could import it back again when processing a batch. However, this is where I found my problem with certain special characters…
To demonstrate the problem, we'll use some test data. I’ll create an array, populate it with a couple of entries, each containing a name and an hourly rate, then export it to CSV, nothing too complicated.
Now we’ll take a look at the produced output which is shown below; not quite what we were hoping for! As you can see both the pound symbol (£) and our “a” with a circumflex (â) have been replaced by a “?” in the CSV file. Consequently, our export has lost data with no way to recover what the original characters were. N.B. This also occurs with other special characters.
So why does this happen?
Well Export-CSV encodes your data as part of the output process and by default it uses a character encoding called UTF8-NoBOM. Using this encoding there aren’t characters that represent these special characters and therefore they are substituted for a question mark.
How do we get around this?
Thankfully, we can specify the encoding we want to use as a parameter of our cmdlet. If we instead use something called UTF8 with BOM (which is confusingly just “UTF8” in PowerShell 5.0+) this will export our characters correctly.
N.B. If for some reason you’re running below PowerShell 5.0 the encoding needs to be declared as “utf8BOM” which in my opinion is much clearer!
I’m curious, tell me more about encoding and BOMs!
I will say that while encoding with a BOM solves this problem beware that it can cause issues. Microsoft state: “For best overall compatibility, avoid using BOMs in UTF-8 files. Unix platforms and Unix-heritage utilities also used on Windows Platforms don't support BOM” (source).
What’s a BOM I hear you ask? Well, I can’t say I had a clue before researching it as part of this problem, but it stands for byte-order-mark and is used as a file signature to identify the encoding of a file. This means a program, such as Notepad, can open it with the correct interpretation of the file rather than having to take a guess, which may be incorrect.
How can we check what encoding has been used?
The easiest way, at least for a CSV, is to open the file in Notepad, select save-as and see what the encoding is listed as. Here we can see it is listed as “UTF-8” which while this isn’t explicit, if you click the dropdown, you’ll see there is also a “UTF-8 with BOM” option meaning we can deduce this file has no BOM.
What does a BOM actually look like?
If we look up the hexadecimal representation of the UTF-8 BOM we can see that is “EF- BB-BF”. As I’m an inquisitive person I went about checking that it was present for my own CSV file. The code below imports our CSV encoded in UTF8 with BOM as bytes, then uses a bit converter to convert into Hex.
And here's the output, Voila! The first 3 Hex characters do indeed match the UTF-8 BOM.
Setting parameters permanently
If you want to set and forget, you can set the default encoding for either a single cmdlet or for all cmdlets as shown below.
Single cmdlet, in this case Export-CSV:
$PSDefaultParameterValues['Export-CSV:Encoding'] = 'utf8'
All cmdlets with the Encoding parameter:
$PSDefaultParameterValues['*:Encoding'] = 'utf8'
In all honesty, I'm not sure how I hadn't stumbled across this problem sooner given how common exporting and importing data is. As warned earlier, do go careful when using BOMs and ensure you plan your data journey to ensure that your output files will be usable from start to finish. Otherwise, I hope this post has been insightful and helps you to avoid the same trouble!