Discussion:
Handling mislabeled emails encoded with Windows-1252
Sebastian Poeplau
2018-07-14 12:40:28 UTC
Permalink
Hi,

This email is to suggest a minor change in how notmuch handles text
encoding when displaying emails. The motivation is the following: I keep
receiving emails that are encoded with Windows-1252 but claim to be
ISO 8859-1. The two character sets only differ in the range between 0x80
and 0x9F where Windows-1252 contains special characters (e.g. “quotation
marks”) while ISO 8859-1 only has non-printable ones. The mislabeling
thus causes some special characters in such emails to be displayed with
a replacement symbol for non-printable characters.

Of course, it would be best to fix the problem on the sender's side,
making their mail client declare the encoding correctly. However,
sometimes this is just not possible and we need to make do with what we
receive. The change I would thus like to suggest is to always treat
ISO 8859-1 as Windows-1252; since the former only contains non-printable
characters in the range where the two differ, we would not lose any
printable information. According to Wikipedia, this substitution is
common in email clients and browsers because of the frequent
mislabeling [1].

Attached you find a simple patch that illustrates my suggestion. While
it works well for my limited use cases, it's obviously not entirely
reliable. Does anyone have a good idea how to better handle the issue? I
searched GMime for related functionality but didn't quite find what I
was looking for. Do you feel that the issue should be raised with the
GMime people instead?

Best regards,
Sebastian

[1] https://en.wikipedia.org/wiki/ISO/IEC_8859-1#Windows-1252
David Bremner
2018-07-24 01:49:23 UTC
Permalink
Post by Sebastian Poeplau
Hi,
This email is to suggest a minor change in how notmuch handles text
encoding when displaying emails. The motivation is the following: I keep
receiving emails that are encoded with Windows-1252 but claim to be
ISO 8859-1. The two character sets only differ in the range between 0x80
and 0x9F where Windows-1252 contains special characters (e.g. “quotation
marks”) while ISO 8859-1 only has non-printable ones. The mislabeling
thus causes some special characters in such emails to be displayed with
a replacement symbol for non-printable characters.
Hi Sebastian;

Everyone's mail situation is unique, but I haven't noticed this
problem. Do you have a mechanical (e.g. scripted) way of detecting such
mails? I suppose it could just look for characters in the range 0x80 to
0x95 in allegedly ISO_8859-1 messages. A census of the situation in my
own mail would help me think about this problem, I think.

David
Sebastian Poeplau
2018-07-24 08:00:55 UTC
Permalink
Hi David,
Post by David Bremner
Everyone's mail situation is unique, but I haven't noticed this
problem. Do you have a mechanical (e.g. scripted) way of detecting such
mails? I suppose it could just look for characters in the range 0x80 to
0x95 in allegedly ISO_8859-1 messages. A census of the situation in my
own mail would help me think about this problem, I think.
Yes, I guess that should be a good enough heuristic for detecting
affected mail. I'll try to come up with a simple script and post it
here.

Cheers,
Sebastian
Sebastian Poeplau
2018-07-24 13:55:54 UTC
Permalink
Hi again,
Post by Sebastian Poeplau
Post by David Bremner
Everyone's mail situation is unique, but I haven't noticed this
problem. Do you have a mechanical (e.g. scripted) way of detecting such
mails? I suppose it could just look for characters in the range 0x80 to
0x95 in allegedly ISO_8859-1 messages. A census of the situation in my
own mail would help me think about this problem, I think.
Yes, I guess that should be a good enough heuristic for detecting
affected mail. I'll try to come up with a simple script and post it
here.
Attached is a Python script that checks individual message files and
prints their name if it finds them to contain mislabeled Windows-1252
text. The heuristic seems to work well on my mail - let me know if you
encounter any issues!

Cheers,
Sebastian
Jeffrey Stedfast
2018-07-24 14:09:19 UTC
Permalink
Hi all (sent his to David already using Reply instead of Reply-All, d'oh!),

GMime actually comes with a stream filter (GMimeFilterWindows) which can auto-detect this situation.

In this particular case, you'd instantiate the GMimeFilterWindows like this:

filter = g_mime_filter_windows_new ("iso-8859-1");

"iso-8859-1" being the charset that the content claims to be in.

Then you'd pipe the raw (decoded but not converted to utf-8) content though the filter and afterward call g_mime_filter_windows_real_charset (filter) which would return, in this user's case, "windows-1252".

Hope that helps,

Jeff
Post by Sebastian Poeplau
Hi,
This email is to suggest a minor change in how notmuch handles text
encoding when displaying emails. The motivation is the following: I keep
receiving emails that are encoded with Windows-1252 but claim to be
ISO 8859-1. The two character sets only differ in the range between 0x80
and 0x9F where Windows-1252 contains special characters (e.g. “quotation
marks”) while ISO 8859-1 only has non-printable ones. The mislabeling
thus causes some special characters in such emails to be displayed with
a replacement symbol for non-printable characters.
Hi Sebastian;

Everyone's mail situation is unique, but I haven't noticed this
problem. Do you have a mechanical (e.g. scripted) way of detecting such
mails? I suppose it could just look for characters in the range 0x80 to
0x95 in allegedly ISO_8859-1 messages. A census of the situation in my
own mail would help me think about this problem, I think.

David


_______________________________________________
notmuch mailing list
***@notmuchmail.org
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnotmuchmail.org%2Fmailman%2Flistinfo%2Fnotmuch&data=02%7C01%7Cjestedfa%40microsoft.com%7C196f62f02155461e6e2408d5f107b75f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636679937804456911&sdata=bI6deYOaU81RwBFmITjg3G1DPvjgP8xiO5cB%2FKIkz58%3D&reserved=0
Sebastian Poeplau
2018-07-24 14:19:20 UTC
Permalink
Hi Jeff,
Post by Jeffrey Stedfast
GMime actually comes with a stream filter (GMimeFilterWindows) which can auto-detect this situation.
filter = g_mime_filter_windows_new ("iso-8859-1");
"iso-8859-1" being the charset that the content claims to be in.
Then you'd pipe the raw (decoded but not converted to utf-8) content though the filter and afterward call g_mime_filter_windows_real_charset (filter) which would return, in this user's case, "windows-1252".
Nice, this is exactly what I was looking for! Somehow I missed it when
checking GMime. I'll adapt my local fix and post the results here.

Thanks,
Sebastian
Sebastian Poeplau
2018-07-28 11:22:46 UTC
Permalink
Hi all,

Here's the updated patch. It filters the message through the
GMimeFilterWindows that Jeff mentioned and then uses the charset it
detects for GMimeFilterCharset in the actual rendering of the message.

Jeff, is this how to use the filter correctly?

Cheers,
Sebastian
Jeffrey Stedfast
2018-07-28 12:25:42 UTC
Permalink
Hi Sebastien,

Yes, that looks good. I would have probably unreffed the null_stream and null_stream_filter inside of that if-block rather than at the end of the function, but that's a stylistic issue that the notmuch authors can comment on. The patch as it stands should work correctly from what I can tell __

As an added optimization, you could try limiting that block of code to just when the charset is one of the iso-8859-* charsets.

The following code snippet should help with that:

charset = charset ? g_mime_charset_canon_name (charset) : NULL;
if (wrapper && charset && g_ascii_strncasecmp (charset, "iso-8859-", 9)) {
...

The reason you need to use g_mime_charset_canon_name (if you decide to add the optimization) is that mail software does not always use the canonical form of the various charset names that they use. Often you will get stuff like "latin1" or "iso_8859-1".

Hope that helps,

Jeff

On 7/28/18, 7:22 AM, "Sebastian Poeplau" <***@eurecom.fr> wrote:

Hi all,

Here's the updated patch. It filters the message through the
GMimeFilterWindows that Jeff mentioned and then uses the charset it
detects for GMimeFilterCharset in the actual rendering of the message.

Jeff, is this how to use the filter correctly?

Cheers,
Sebastian
Sebastian Poeplau
2018-07-30 07:28:57 UTC
Permalink
Hi,
Post by Jeffrey Stedfast
Yes, that looks good. I would have probably unreffed the null_stream
and null_stream_filter inside of that if-block rather than at the end
of the function, but that's a stylistic issue that the notmuch authors
can comment on. The patch as it stands should work correctly from what
I can tell __
I was worried about the string returned by
g_mime_filter_windows_real_charset: once I unref everything, isn't there
a risk of the filter being deleted? As far as I can tell from the code,
the returned charset might be a pointer into the filter object...
Post by Jeffrey Stedfast
As an added optimization, you could try limiting that block of code to
just when the charset is one of the iso-8859-* charsets.
charset = charset ? g_mime_charset_canon_name (charset) : NULL;
if (wrapper && charset && g_ascii_strncasecmp (charset, "iso-8859-", 9)) {
...
The reason you need to use g_mime_charset_canon_name (if you decide to
add the optimization) is that mail software does not always use the
canonical form of the various charset names that they use. Often you
will get stuff like "latin1" or "iso_8859-1".
Nice, I'll add it.

Thanks a lot,
Sebastian
Sebastian Poeplau
2018-07-30 07:47:55 UTC
Permalink
Hi,
Post by Sebastian Poeplau
Post by Jeffrey Stedfast
As an added optimization, you could try limiting that block of code to
just when the charset is one of the iso-8859-* charsets.
charset = charset ? g_mime_charset_canon_name (charset) : NULL;
if (wrapper && charset && g_ascii_strncasecmp (charset, "iso-8859-", 9)) {
...
The reason you need to use g_mime_charset_canon_name (if you decide to
add the optimization) is that mail software does not always use the
canonical form of the various charset names that they use. Often you
will get stuff like "latin1" or "iso_8859-1".
Nice, I'll add it.
Updated patch attached.

Cheers,
Sebastian
David Bremner
2018-07-31 09:07:03 UTC
Permalink
Post by Sebastian Poeplau
Post by Sebastian Poeplau
Nice, I'll add it.
Updated patch attached.
Cheers,
Sebastian
Thanks to both of you for working on this. The code looks ok to me, I
have only some procedural comments.

In order to merge it I'll need at least one test. I think
test/T300-encoding.sh is probably the right place. There are a few
different styles of test; you can either put things in variables as in
that file, or use the more dominant

test_subtest_begin_test "description"
cat << EOF > EXPECTED
this is my expected output
EOF
notmuch show STUFF > OUTPUT
test_expect_equal_file EXPECTED OUTPUT

Feel free to bug the list for help on making tests (or #notmuch on
freenode).

Please also use git-send-email to send your patch(es), with commit
messages with an eye to

https://notmuchmail.org/contributing/#index5h2

To minimize the chance of problems, it's probably best to base your
commits on master, although the patch you sent applied fine here.

Thanks,

David
Sebastian Poeplau
2018-07-31 09:49:31 UTC
Permalink
Hi David,

Thanks for the hints! I'll prepare a test and the patch based on master
shortly.

Cheers,
Sebastian
Post by David Bremner
Post by Sebastian Poeplau
Post by Sebastian Poeplau
Nice, I'll add it.
Updated patch attached.
Cheers,
Sebastian
Thanks to both of you for working on this. The code looks ok to me, I
have only some procedural comments.
In order to merge it I'll need at least one test. I think
test/T300-encoding.sh is probably the right place. There are a few
different styles of test; you can either put things in variables as in
that file, or use the more dominant
test_subtest_begin_test "description"
cat << EOF > EXPECTED
this is my expected output
EOF
notmuch show STUFF > OUTPUT
test_expect_equal_file EXPECTED OUTPUT
Feel free to bug the list for help on making tests (or #notmuch on
freenode).
Please also use git-send-email to send your patch(es), with commit
messages with an eye to
https://notmuchmail.org/contributing/#index5h2
To minimize the chance of problems, it's probably best to base your
commits on master, although the patch you sent applied fine here.
Thanks,
David
Loading...