Counting the byte size of a file encoded in ISO 8859-7 in JavaScript

Question

Background

I am writing an esoteric language called Jolf. It is used on the lovely site codegolf SE. If you don't already know, a lot of challenges are scored in bytes. People have made lots of languages that utilize either their own encoding or a pre-existing encoding.

On the interpreter for my language, I have a byte counter. As you might expect, it counts the number of bytes in the code. Until now, I've been using a UTF-8 en/decoder (utf8.js). I am now using the ISO 8859-7 encoding, which has Greek characters. Nor does the text upload actually work. I need to count the actually bytes contained within an uploaded file. Also, is there a way to read the contents of said encoded file?

Question

Given a file encoded in ISO 8859-7 obtained from an <input> element on the page, is there any way to obtain the number of bytes contained in that file? And, given "plaintext" (i.e. text put directly into a <textarea>), how might I count the bytes in that as if it was encoded in ISO 8859-7?

What I've tried

The input element is called isogreek. The file resides in the <input> element. The content is ΦX族, a Greek character, a latin character (each of which should be a byte) and a Chinese character, which should be more than one byte (?).

isogreek.files[0].size;      // is 3; should be more.

var reader = new FileReader();
reader.readAsBinaryString(isogreek.files[0]);      // corrupts the string to `ÖX?`
reader.readAsText(isogreek.files[0]);              // �X?
reader.readAsText(isogreek.files[0],"ISO 8859-7"); // �X?

If this is a file you receive serverside, why not just look at the size of the file? — pvg
How is the file actually encoded? How do you get a chinese char into an 8-bit, single char encoding? Something here doesn't make sense. — pvg
@pvg I saved that text as in a ISO-8859-7 encoded file. It probably was dropped, or something. The point being, that character can appear in plaintext that needs to be counted as if it were encoded in 8859-7. — Conor O'Brien
Right but 族 has no meaningful 8859-7 encoding. And since it's a 1-char, 1-byte encoding, as far as i can tell, string or file length should equal char length in that encoding. — pvg

ETHproductions ETHproductions · Accepted Answer · 2016-01-14T22:17:06

_{Extended from this comment.}

As @pvg mentioned in the comments, the string resulting from readAsBinaryString would be correct, but is corrupted for two reasons:

A. The result is encoded in ISO-8859-1. You can use a function to fix this:

function convertFrom1to7(text) {
  // charset is the set of chars in the ISO-8859-7 encoding from 0xA0 and up, encoded with this format:
  // - If the character is in the same position as in ISO-8859-1/Unicode, use a "!".
  // - If the character is a Greek char with 720 subtracted from its char code, use a ".".
  // - Otherwise, use \uXXXX format.
  var charset = "!\u2018\u2019!\u20AC\u20AF!!!!.!!!!\u2015!!!!...!...!.!....................!............................................!";
  var newtext = "", newchar = "";
  for (var i = 0; i < text.length; i++) {
    var char = text[i];
    newchar = char;
    if (char.charCodeAt(0) >= 160) {
      newchar = charset[char.charCodeAt(0) - 160];
      if (newchar === "!") newchar = char;
      if (newchar === ".") newchar = String.fromCharCode(char.charCodeAt(0) + 720);
    }
    newtext += newchar;
  }
  return newtext;
}

B. The Chinese character isn't a part of the ISO-8859-7 charset (because the charset supports up to 256 unique chars, as the table shows). If you want to include arbitrary Unicode characters in a program, you will probably need to do one of these two things:

Count the bytes of that program in i.e. UTF-8 or UTF-16. This can be done pretty easily with the library you linked. However, if you want this to be done automatically, you'll need a function that checks if the content of the textarea is a valid ISO-8859-7 file, like this:

function isValidISO_8859_7(text) {
  var charset = /[\u0000-\u00A0\u2018\u2019\u00A3\u20AC\u20AF\u00A6-\u00A9\u037A\u00AB-\u00AD\u2015\u00B0-\u00B3\u0384-\u0386\u00B7\u0388-\u038A\u00BB\u038C\u00BD\u038E-\u03CE]/;
  var valid = true;
  for (var i = 0; i < text.length; i++) {
    valid = valid && charset.test(text[i]);
  }
  return valid;
}

Create your own, custom variant of ISO-8859-7 that uses a specific byte (or more than one) to signify that the next 2 or 3 bytes belong to a single Unicode char. This can be pretty much as simple or complex as you like, from one char signifying a 2-byte char and one signifying a 3-byter to everything between 80 and 9F setting up for the next few. Here's a basic example that uses 80 as the 2-byter and 81 as the 3-byter (assumes the text is encoded in ISO-8859-1):

function reUnicode(text) {
  var newtext = "";
  for (var i = 0; i < text.length; i++) {
    if (text.charCodeAt(i) === 0x80) {
      newtext += String.fromCharCode((text.charCodeAt(++i) << 8) + text.charCodeAt(++i));
    } else if (text.charCodeAt(i) === 0x81) {
      var charcode = (text.charCodeAt(++i) << 16) + (text.charCodeAt(++i) << 8) + text.charCodeAt(++i) - 65536;
      newtext += String.fromCharCode(0xD800 + (charcode >> 10), 0xDC00 + (charcode & 1023)); // Convert into a UTF-16 surrogate pair
    } else {
      newtext += convertFrom1to7(text[i]);
    }
  }
  return newtext;
}

I can go into either method in more detail if you desire.

Counting the byte size of a file encoded in ISO 8859-7 in JavaScript

Background

Question

What I've tried

2 Answers