Go To... offset ignores BOM
First, thank you guys for a great, great text editor. Brilliant work.
Now, working with UTF-8 files, I noticed that using the “Go To…/Offset” does not take into account the leading 3-char BOM (ef bb bf).
The simplest example is to create a UTF-8 BOM file and type “ab”. Then bring up the “Go To…” dialog and type 2 (as offset). This puts the caret behind the “b”. It shouldn’t. Offset-wise, it should remain before the “a”.
I get it might be a tricky problem to implement correctly but ignoring the BOM ignores characters that affect the true offset in a file.
(Note that the “length” indicator in the status bar does not report the true size of the file either)
Thank you all.
You should not be able to edit BOM. BOM is something the editor adds on save.
You also don’t see BOM with ‘View -> Show Symbol -> …’ which is OK.
‘Offset’ is offset with symbols not bytes. If an encoding take 1 or 4 bytes to encode a specific symbol it does not matter, offset will still be 1.
I’m not talking about editing the BOM, just accounting for it in the ‘Go to…/offset’.
Thank you for pointing out that ‘length’ does not report file size at all. I was getting all kinds of errors.
I have to partially agree and disagree.
Afaik, offset takes bytes into account but can’t
sometimes, but obviously, not display it “correctly”
as only one “char-width” space is reserved even so the char
itself uses more bytes for its representation.
And, length is the length of the buffer, loaded into scintilla view, in bytes.
So it might be the file length as well or not in case of BOM encoded files.
lengthrepresented bytes then
- For UTF16 encodings every 2 successive offsets would jump to the same location.
- Changing encoding from UTF8 to UTF16 would change the length.
This is not the case.
View -> summaryshows file length in bytes (and only for saved files).
It seems that
Offsetcount unicode symbols which in my opinion is the correct thing to do.
An editor deals with symbols. Its internal encoding for each symbol is irrelevant. On save it should re-encode the document content into properly encoded file and on load do the reverse. These are the only time where BOM is relevant.
I do see your points and I wasn’t aware that this is different to utf8 encoded files.
My first example shows, that file length is 15 but only 13 visible symbols.
And knowing that scintilla states that
SCI_GETTEXTLENGTH → int SCI_GETLENGTH → int Both these messages return the length of the document in bytes.
I was under the impression that this is the case for all documents.
Here another example.
UTF-8 encoded text is aßz
HexEditor shows 4 bytes
Length shows 4
Summary shows 4
So it looks like npp is doing something under the hood like encode/decode certain encodings
but not all. Or we identified a bug.
Will try to find out what exactly is going on.
Thank you, Claudia. I agree a single solution is unclear. It might come down to providing two ways of looking at the file contents as I mentioned in a BOM option (or “true byte”) in the ‘Go to…/offset’. The status bar can also have data for both views (symbols + bytes).
Did you try pasting aßz going to offset 2 and delete? I don’t think that this is a useful feature.
For the record I was not even aware of the ‘offset’ option in the ‘goto’ dialog and I don’t find it very useful.
I guess the current state is a mess of half baked definitions (from a time dominated by ANSI) and lacking implementation that fail in UNICODE era.
I still think that the guidelines I described above are the correct way to implement it. User sees and edits symbols, not bytes.
‘offset’ option in the ‘goto’ dialog and I don’t find it very useful.
This feature of the Goto dialog can be useful to Pythonscript programmers (and probably also Luascript and Plugin (Scintilla) programmers) that often need to deal with “position” in a document. Not so much as a tool to change the current position, but as a way to see what the current caret position is during test/debug of code that works with position…
Obviously any user uses his own subset of features.
Went through some of code I wrote once upon a time to refresh my memory.
As far as I can tell Scintilla logical view of the document is of array of bytes that holds UTF-8 encoding of the text. For each line number Scintilla knows its current start byte offset into the “array”.
This approach is simple and flexible but it demands lots of attention from whoever uses it who may care about Unicode.
Scintilla will not protect anyone from placing the caret between bytes that compose a single UTF8 symbol.
So I guess that length and offset displayed by NPP are actually byte offsets for UTF-8 encoding, regardless of the encoding in which the file is written.
I still think it is a confusing choice but these definitions should be made by people who actually use this feature.
What I guess I found out so far is the following
First npp tries to detect if the file is a BOM file, if it is,
it gets rid of the BOM signature and continues reading the file in converted utf-8.
If it isn’t a BOM file it’s calling chardet library to see what codepage to use.
If chardet returns, it is checked if it is reported to be utf-8 -> go on reading the file …
if not, convert it to utf-8.
But this, of course, happens only “virtual” for scintilla control.