UTF-16/Unicode to byte array in JavaScript

Rushing to the solution (in case you need it now)

The following JavaScript will build a Byte Array from a string. It should even handle exotic characters (like new unicode emoji’s), which is better than some of the other implementations that you may find online.

String.prototype.toByteArray=String.prototype.toByteArray||(function(e){for(var b=[],c=0,f=this.length;c<f;c++){var a=this.charCodeAt(c);if(55296<=a&&57343>=a&&c+1<f&&!(a&1024)){var d=this.charCodeAt(c+1);55296<=d&&57343>=d&&d&1024&&(a=65536+(a-55296<<10)+(d-56320),c++)}128>a?b.push(a):2048>a?b.push(192|a>>6,128|a&63):65536>a?(55296<=a&&57343>=a&&(a=e?65534:65533),b.push(224|a>>12,128|a>>6&63,128|a&63)):1114111<a?b.push(239,191,191^(e?1:2)):b.push(240|a>>18,128|a>>12&63,128|a>>6&63,128|a&63)}return b})

String.prototype.toByteArray=String.prototype.toByteArray||(function(e){for(var b=[],c=0,f=this.length;c<f;c++){var a=this.charCodeAt(c);if(55296<=a&&57343>=a&&c+1<f&&!(a&1024)){var d=this.charCodeAt(c+1);55296<=d&&57343>=d&&d&1024&&(a=65536+(a-55296<<10)+(d-56320),c++)}128>a?b.push(a):2048>a?b.push(192|a>>6,128|a&63):65536>a?(55296<=a&&57343>=a&&(a=e?65534:65533),b.push(224|a>>12,128|a>>6&63,128|a&63)):1114111<a?b.push(239,191,191^(e?1:2)):b.push(240|a>>18,128|a>>12&63,128|a>>6&63,128|a&63)}return b})

After you have run the previous line of JavaScript, the function is assigned to the String prototype. This means you can call the function on any string, like this:

var byteArray = "abc 👩🏾‍🏫".toByteArray(false);

1	var byteArray = "abc 👩🏾‍🏫".toByteArray(false);

The (optional) boolean argument is only relevant if your string may contain invalid encoding. In that case this data is by default replaced with the U+FFFD � replacement character, but if you give true as an argument, the U+FFFE “not a character”-character is used (explicitly marking the data as invalid).

What did I need this for

In SharePoint 2013 you can create files in JavaScript, but you need to define the file content as a byte array. So if you like to write a log file to a library for instance, you first need to convert your string content to a byte array. The examples that I found online, did not fully support UTF-16 and Unicode. So I’ve tried to understand what was going on and I’ve build a solution that worked for me.

The following example function is an illustration of how to write a text file to a SharePoint document library (you can find plenty of other examples for this online):

// Before calling this function, ensure you have loaded SP.js and SP.Runtime.js (you can use Script on Demand for this)
function createSPFile(listUrl, fileName, byteArray, overwrite, successHandler, failHandler)
{
	var SP = window.SP; //I like to refer to the global scope explicitly.
	var ctx = SP.ClientContext.get_current();
	
	var web = ctx.get_web();
	ctx.load(web);
	ctx.executeQueryAsync(function afterLoadingRequirements() {
		var webUrl = web.get_serverRelativeUrl();
		var files = web.getList([webUrl, listUrl].join('/')).get_rootFolder().get_files();
		var info = new SP.FileCreationInformation();
		info.set_url([webUrl, listUrl, fileName].join('/'));
		info.set_overwrite(!!overwrite);
		
		var content = new SP.Base64EncodedByteArray();
		for(var i = 0, n = byteArray.length; i<n; i++)
		{
			content.append(byteArray[i]);
		}
		info.set_content(content);
		
		var file = files.add(info);
		ctx.load(file);
		ctx.executeQueryAsync(successHandler, failHandler);
	}, failHandler);
}

// Before calling this function, ensure you have loaded SP.js and SP.Runtime.js (you can use Script on Demand for this)

function createSPFile(listUrl, fileName, byteArray, overwrite, successHandler, failHandler)

{

var SP = window.SP; //I like to refer to the global scope explicitly.

var ctx = SP.ClientContext.get_current();

var web = ctx.get_web();

ctx.load(web);

ctx.executeQueryAsync(function afterLoadingRequirements() {

var webUrl = web.get_serverRelativeUrl();

var files = web.getList([webUrl, listUrl].join('/')).get_rootFolder().get_files();

var info = new SP.FileCreationInformation();

info.set_url([webUrl, listUrl, fileName].join('/'));

info.set_overwrite(!!overwrite);

var content = new SP.Base64EncodedByteArray();

for(var i = 0, n = byteArray.length; i<n; i++)

{

content.append(byteArray[i]);

}

info.set_content(content);

var file = files.add(info);

ctx.load(file);

ctx.executeQueryAsync(successHandler, failHandler);

}, failHandler);

}

I’ve written the above function as an example. It works but I’ve omitted things to focus on the subject.

If you have included both functions (toByteArray and createSPFile), the following example shows how to create a hello_world.txt file in the Documents library:

createSPFile("Documents", "hello_world.txt", "\uFEFFHello World! \n 👋 \r\n 🇳🇱".toByteArray(), true, function(){alert('file has been created!');}, function(){alert('failed: ' + arguments[1].get_message())});

1	createSPFile("Documents", "hello_world.txt", "\uFEFFHello World! \n 👋 \r\n 🇳🇱".toByteArray(), true, function(){alert('file has been created!');}, function(){alert('failed: ' + arguments[1].get_message())});

Some things to keep in mind when writing to a file:
– Start the file with a Byte Order Mark (in the example above I use the Byte Order Mark for UFT-16 Big Endian: \uFEFF )
– Make a choice in the type of newline you wish to use (Windows: \r\n or Unix: \n). It’s not uncommon to use unix newlines in JavaScript strings, but you might prefer to use Windows newlines in your text file.

The comprehensive implementation of the toByteArray function

The toByteArray implementation that I shared at the start of this blog was a minified version (using Google’s Closure Compiler). I will also share the long version here, with comments, so you can check what is going on in my implementation.

String.prototype.toByteArray = String.prototype.toByteArray||function toByteArray(strict)
{
	var value = this;
	var result = [];

	//Some constant values, usefull in the conversion calculations
	var bit5 = 1 << 4;  // 0001 0000
	var bit6 = 1 << 5;  // 0010 0000
	var bit7 = 1 << 6;  // 0100 0000
	var bit8 = 1 << 7;  // 1000 0000
	var bit9 = 1 << 8;  // etc
	var bit11 = 1 << 10;
	var bit12 = 1 << 11;
	var bit17 = 1 << 16;
	var last6bits = bit7 - 1; // 0011 1111

	var utf16SurrogateStart = (27 << 11);
	var utf16SurrogateEnd = (28 << 11) - 1;

	for (var i = 0, n = value.length; i < n; i++) {
		var charCode = value.charCodeAt(i);

		//UTF-16 makes use of surrogate pairs for code points above U+FFFF
		// These surrogate pairs have charcodes in a specific range.
		//  If we find two sibling charcodes that match this range, we merge these surrogate pair codes into one unicode charcode value.
		if (charCode >= utf16SurrogateStart && charCode <= utf16SurrogateEnd && i + 1 < n && !(charCode & bit11)) {
			// We retrieve the next charcode, so we can check if it is also in the range for a UTF-16 surrogate pair:
			var potentialPairValue = value.charCodeAt(i + 1);

			// Check if the next charcode also fits the expectation for a UTF-16 surrogate pair:
			if (potentialPairValue >= utf16SurrogateStart && potentialPairValue <= utf16SurrogateEnd && (potentialPairValue & bit11)) {
				// Combine the two values into one unicode value:
				charCode = bit17 + ((charCode - utf16SurrogateStart) << 10) + (potentialPairValue - (utf16SurrogateStart|bit11));

				//The charcode of the next iteration is included in the current unicode value
				// so the index needs to move up 1 extra position:
				i++;
			}
		}

		if (charCode < bit8) {
			//A byte will contain 8 bits. If the first bit of this byte is 1, it means the value of the character is distributed over multiple bytes.
			//For a value that occupies 7 bits or less, the byte can start with a 0, so it only needs to occupy one byte in the array:
			result.push(charCode);
		}
		else if (charCode < bit12) {
			// The value occupies 8 to 11 bits, so we need to distribute this value over multiple bytes that start with a 1.
			//  The first byte will also mark over how many bytes the value is distributed, the other bytes only need to confirm that they are part of the distribution.
			// The following shows how the bits of the charCode value are distributed over the bytes.
			//  The X stands for a bit from the charCode value, the 1 or 0 stands for a fixed value that indicates the type of distribution:
			//  110x xxxx   //110 means: start distribution over 2 bytes 
			//  10xx xxxx   //10 means: part of a distribution, but not the start.
			result.push(bit8 | bit7 | (charCode >> 6), //shift 6 to right to exclude the last 6 bits that are included in the second byte.
						bit8 | (charCode & last6bits)); //Only include the last 6 bits of charCode and set bit8 to 1.
		} else if (charCode < bit17) {
			// The value that accupies 12 to 16 bits is distributed over 3 bytes in the following way:
			// 1110 xxxx    //1110 means: start distribution over 3 bytes
			// 10xx xxxx
			// 10xx xxxx

			if (charCode >= utf16SurrogateStart && charCode <= utf16SurrogateEnd) {
				// This is a reserved range for UTF-16 surrogate pairs for code points above U+FFFF.
				//  Since the charCode is still in this range, it was not merged with a second pair value.
				// This character apparently was not part of a valid surrogate pair.

				// We need to indicate this problem in the output array and have a choice here.
				// If the strict argument has been set to true:
				//  use the U+FFFE (65534) "Noncharacter"-character (http://www.fileformat.info/info/unicode/char/fffe/index.htm)
				// By default (or if the strict argument has been set to false):
				//  use the U+FFFD (65533) replacement character (http://www.fileformat.info/info/unicode/char/fffd/index.htm)
				charCode = strict ? 65534 : 65533;
			}

			result.push(bit8 | bit7 | bit6 | (charCode >> 12), //Exclude the last 12 bits that are distribued over the next bytes
						bit8 | ((charCode >> 6) & last6bits),  //Exclude the last 6 bits, then only include the last remaining 6 bits (the bits before that where in the first byte)
						bit8 | (charCode & last6bits));        //Only include the last 6 bits, that where excluded from both previous bytes
		} else if (charCode > 1114111) {
			// the Unicode range stops at U+10FFFF, above this can't be encoded...
			// If in strict mode, mark as invalid character, if not in strict mode, mark as unknown/unrepresentable character:
			result.push(
				(bit9 - 1) ^ bit5,
				(bit9 - 1) ^ bit7,
				(bit9 - 1) ^ bit7 ^ (strict ? 1 : 2));
		} else {
			// If a character occupies more than 16 bits, the value is distributed over four bytes the following way:
			// 1111 0xxx    //1111 0 means: start distribution over 4 bytes
			// 10xx xxxx
			// 10xx xxxx
			// 10xx xxxx

			// I did not find JavaScript implementations at the time that support this range of characters,
			//  but popular emoji characters fall in this range, so it's not trivial to include it.
			result.push(bit8 | bit7 | bit6 | bit5 | (charCode >> 18),
						bit8 | ((charCode >> 12) & last6bits),
						bit8 | ((charCode >> 6) & last6bits),
						bit8 | (charCode & last6bits));
		}
	}

	return result;
}

100

101

String.prototype.toByteArray = String.prototype.toByteArray||function toByteArray(strict)

{

var value = this;

var result = [];

//Some constant values, usefull in the conversion calculations

var bit5 = 1 << 4; // 0001 0000

var bit6 = 1 << 5; // 0010 0000

var bit7 = 1 << 6; // 0100 0000

var bit8 = 1 << 7; // 1000 0000

var bit9 = 1 << 8; // etc

var bit11 = 1 << 10;

var bit12 = 1 << 11;

var bit17 = 1 << 16;

var last6bits = bit7 - 1; // 0011 1111

var utf16SurrogateStart = (27 << 11);

var utf16SurrogateEnd = (28 << 11) - 1;

for (var i = 0, n = value.length; i < n; i++) {

var charCode = value.charCodeAt(i);

//UTF-16 makes use of surrogate pairs for code points above U+FFFF

// These surrogate pairs have charcodes in a specific range.

// If we find two sibling charcodes that match this range, we merge these surrogate pair codes into one unicode charcode value.

if (charCode >= utf16SurrogateStart && charCode <= utf16SurrogateEnd && i + 1 < n && !(charCode & bit11)) {

// We retrieve the next charcode, so we can check if it is also in the range for a UTF-16 surrogate pair:

var potentialPairValue = value.charCodeAt(i + 1);

// Check if the next charcode also fits the expectation for a UTF-16 surrogate pair:

if (potentialPairValue >= utf16SurrogateStart && potentialPairValue <= utf16SurrogateEnd && (potentialPairValue & bit11)) {

// Combine the two values into one unicode value:

charCode = bit17 + ((charCode - utf16SurrogateStart) << 10) + (potentialPairValue - (utf16SurrogateStart|bit11));

//The charcode of the next iteration is included in the current unicode value

// so the index needs to move up 1 extra position:

i++;

}

if (charCode < bit8) {

//A byte will contain 8 bits. If the first bit of this byte is 1, it means the value of the character is distributed over multiple bytes.

//For a value that occupies 7 bits or less, the byte can start with a 0, so it only needs to occupy one byte in the array:

result.push(charCode);

}

else if (charCode < bit12) {

// The value occupies 8 to 11 bits, so we need to distribute this value over multiple bytes that start with a 1.

// The first byte will also mark over how many bytes the value is distributed, the other bytes only need to confirm that they are part of the distribution.

// The following shows how the bits of the charCode value are distributed over the bytes.

// The X stands for a bit from the charCode value, the 1 or 0 stands for a fixed value that indicates the type of distribution:

// 110x xxxx //110 means: start distribution over 2 bytes

// 10xx xxxx //10 means: part of a distribution, but not the start.

result.push(bit8 | bit7 | (charCode >> 6), //shift 6 to right to exclude the last 6 bits that are included in the second byte.

bit8 | (charCode & last6bits)); //Only include the last 6 bits of charCode and set bit8 to 1.

} else if (charCode < bit17) {

// The value that accupies 12 to 16 bits is distributed over 3 bytes in the following way:

// 1110 xxxx //1110 means: start distribution over 3 bytes

// 10xx xxxx

if (charCode >= utf16SurrogateStart && charCode <= utf16SurrogateEnd) {

// This is a reserved range for UTF-16 surrogate pairs for code points above U+FFFF.

// Since the charCode is still in this range, it was not merged with a second pair value.

// This character apparently was not part of a valid surrogate pair.

// We need to indicate this problem in the output array and have a choice here.

// If the strict argument has been set to true:

// use the U+FFFE (65534) "Noncharacter"-character (http://www.fileformat.info/info/unicode/char/fffe/index.htm)

// By default (or if the strict argument has been set to false):

// use the U+FFFD (65533) replacement character (http://www.fileformat.info/info/unicode/char/fffd/index.htm)

charCode = strict ? 65534 : 65533;

}

result.push(bit8 | bit7 | bit6 | (charCode >> 12), //Exclude the last 12 bits that are distribued over the next bytes

bit8 | ((charCode >> 6) & last6bits), //Exclude the last 6 bits, then only include the last remaining 6 bits (the bits before that where in the first byte)

bit8 | (charCode & last6bits)); //Only include the last 6 bits, that where excluded from both previous bytes

} else if (charCode > 1114111) {

// the Unicode range stops at U+10FFFF, above this can't be encoded...

// If in strict mode, mark as invalid character, if not in strict mode, mark as unknown/unrepresentable character:

result.push(

(bit9 - 1) ^ bit5,

(bit9 - 1) ^ bit7,

(bit9 - 1) ^ bit7 ^ (strict ? 1 : 2));

} else {

// If a character occupies more than 16 bits, the value is distributed over four bytes the following way:

// 1111 0xxx //1111 0 means: start distribution over 4 bytes

// 10xx xxxx

// I did not find JavaScript implementations at the time that support this range of characters,

// but popular emoji characters fall in this range, so it's not trivial to include it.

result.push(bit8 | bit7 | bit6 | bit5 | (charCode >> 18),

bit8 | ((charCode >> 12) & last6bits),

bit8 | ((charCode >> 6) & last6bits),

bit8 | (charCode & last6bits));

}

return result;

}

What’s up with the optional boolean argument

The string input data can have invalid encoding in the following ways:

The charcode is out of range even for Unicode
The charcode represents a part of a UTF-16 surrogate pair, but the other part of the pair is missing or defined incorrectly.

I’ve looked at a few other implementations to produce a byte array from a string (not only JavaScript implementations), and noticed that different libraries lead to different results in this scenario. It seems most common to just replace the invalid encoding with the U+FFFD replacement character. This basically means that there was data, but it has been lost while processing.

An alternative that seems just as valid is to use the U+FFFE “not a character” charcode in the byte array. If you give true as an argument to the toByteArray function, this charcode will be used in the byte array for invalid incoding in stead of U+FFFD.

I’m not sure about all the nuances between U+FFFE and U+FFFD, but it seemed relevant to leave U+FFFE in as an option. To me FFFD seems more forgiving, common and user friendly (usually presented with a question mark �), so I’ve chosen that value as the default.

TheSharePoint.nl

Categories

UTF-16/Unicode to byte array in JavaScript

Rushing to the solution (in case you need it now)

What did I need this for

The comprehensive implementation of the toByteArray function

What’s up with the optional boolean argument

Author: Pieter-Bart van Splunter

Leave a Reply Cancel reply

Recent Posts

My employer

MCSD Certified