Discussion:
'notmuch search thread:<>' lists multiple threads
Naveen N. Rao
2018-04-06 09:46:24 UTC
Permalink
Greetings--
If I search for threads matching a specific thread-id, I am seeing
multiple results:

$ notmuch search --output=threads thread:00000000000c4d20
thread:00000000000c4d1e
thread:00000000000c4d20

If I list the messages from both those threads, they do belong to the
same original mailing list thread. It isn't clear why notmuch is
assigning different thread IDs. Is that to be expected under some
scenarios?

Also, it is a bit weird to see multiple threads being listed when
searching for a specific thread ID. Again, is this something to be
expected?


- Naveen
Naveen N. Rao
2018-04-06 10:23:04 UTC
Permalink
Post by Naveen N. Rao
Greetings--
If I search for threads matching a specific thread-id, I am seeing
$ notmuch search --output=threads thread:00000000000c4d20
thread:00000000000c4d1e
thread:00000000000c4d20
Expanding on this:

[04/06 15:37:59 ~]$ notmuch search --output=messages thread:00000000000c4d1e
id:CAKTCnzngahex_sL2raoHFuXqTxgVV7a57R9YmcT1TN-***@mail.gmail.com
[04/06 15:49:34 ~]$
[04/06 15:38:01 ~]$ notmuch search --output=messages thread:00000000000c4d20
id:CAKTCnzngahex_sL2raoHFuXqTxgVV7a57R9YmcT1TN-***@mail.gmail.com
id:CAOSf1CFy0im+***@mail.gmail.com
id:20180405071500.22320-4-***@gmail.com
id:20180405071500.22320-3-***@gmail.com
id:20180405071500.22320-2-***@gmail.com
id:20180405071500.22320-1-***@gmail.com
[04/06 15:49:34 ~]$
[04/06 15:49:26 ~]$ notmuch show --format=raw id:CAKTCnzngahex_sL2raoHFuXqTxgVV7a57R9YmcT1TN-***@mail.gmail.com | grep -e "In-Reply-To" -e "References" -A2
In-Reply-To:
<CAOSf1CFy0im+***@mail.gmail.com>
References: <20180405071500.22320-1-***@gmail.com>
<20180405071500.22320-3-***@gmail.com>
<CAOSf1CFy0im+***@mail.gmail.com>
[04/06 15:50:01 ~]$
[04/06 15:50:02 ~]$ notmuch show --format=raw id:CAOSf1CFy0im+***@mail.gmail.com | grep -e "In-Reply-To" -e "References" -A1
In-Reply-To: <20180405071500.22320-3-***@gmail.com>
References: <20180405071500.22320-1-***@gmail.com>
<20180405071500.22320-3-***@gmail.com>


- Naveen
David Bremner
2018-04-08 03:04:35 UTC
Permalink
Post by Naveen N. Rao
Greetings--
If I search for threads matching a specific thread-id, I am seeing
$ notmuch search --output=threads thread:00000000000c4d20
thread:00000000000c4d1e
thread:00000000000c4d20
This looks like a bug to me. I was able to replicate it in my own mail
store with the script at the end of the message. I haven't completely
analyzed the situation yet, but one thing I noticed is that in all
"bad threads", there are files with duplicate message-ids. Typical
output looks like

╭─ zancas:software/upstream/notmuch/test
╰─ (git)-[master]-% notmuch search thread:000000000001760a
thread:00000000000175e5 November 03 [1/2(3)] ***@gmx.us; Bug#846042: VTK 8 (unread)
thread:000000000001760a 2016-11-27 [1/2(3)] ***@gmx.us; Bug#846042: virtual/meta package for python-vtk (unread)

At least some of this mail data is public, but I'm not sure if the bad
threading is reproducible or not; I want to run a complete census
overnight before I reindex.

Even if the bug is non-deterministic, it probably lives in lib/add-message.cc

----------------------------------------------------------------------

count=0
success=0
for id in $(notmuch search --output=threads '*'); do
count=$((count +1))
matches=$((`notmuch search --output=threads "$id" | wc -l`))
if [ "$matches" = 1 ]; then
success=$((success + 1))
else
echo "bad thread: $id"
fi
if [ $((count % 1000)) -eq 0 ]; then
echo $count;
fi
done

echo "count=$count success=$success"
David Bremner
2018-04-09 11:54:01 UTC
Permalink
Post by David Bremner
At least some of this mail data is public, but I'm not sure if the bad
threading is reproducible or not; I want to run a complete census
overnight before I reindex.
Even if the bug is non-deterministic, it probably lives in lib/add-message.cc
I have a reproducible test for this bug now

http://pivot.cs.unb.ca/git?p=notmuch.git;a=shortlog;h=refs/heads/fix/thread-search

I still need to analyze the mails a bit more, but it looks like at least
one of the strange results is caused by multiple mail files sharing the
same message-id, but with different References headers (and no
In-Reply-To headers).

d
David Bremner
2018-04-10 01:45:39 UTC
Permalink
This is useful for understanding the case where different
message-files with the same message-id have distinct reference
headers.
---
devel/draw-thread | 35 +++++++++++++++++++++++++++++++++++
1 file changed, 35 insertions(+)
create mode 100755 devel/draw-thread

diff --git a/devel/draw-thread b/devel/draw-thread
new file mode 100755
index 00000000..628dcff4
--- /dev/null
+++ b/devel/draw-thread
@@ -0,0 +1,35 @@
+#!/bin/bash
+
+# This script can be used like
+# NOTMUCH_CONFIG=test/tmp.T580-thread-search/notmuch-config \
+# devel/draw-thread thread:0000000000000002 | dot -Tpdf > thread2.pdf
+
+# In addition to notmuch, you will need the following tools installed
+# - graphviz
+# - formail (part of procmail)
+
+threadid=$1
+
+declare -a edges
+
+declare -a dest
+echo "digraph \"$threadid\" {"
+for messageid in $(notmuch search --output=messages $threadid); do
+ echo "subgraph \"cluster_$messageid\" {"
+ printf "\"%s\" [shape=folder];\n" ${messageid#id:}
+ for file in $(notmuch search --output=files $messageid); do
+ node=$(basename $file)
+ printf "\"%s\" [shape=note];\n" $node
+
+ mapfile -t dest < <(formail -x references < $file | tr '<>,' '"" ')
+ edge="\"$node\" -> { ${dest[*]} }"
+ edges+=($edge)
+ done
+ echo "}"
+done
+
+for edge in "${edges[*]}"; do
+ echo $edge
+done
+
+echo "}"
--
2.16.3
Daniel Kahn Gillmor
2018-10-08 03:30:31 UTC
Permalink
Post by David Bremner
This is useful for understanding the case where different
message-files with the same message-id have distinct reference
headers.
---
devel/draw-thread | 35 +++++++++++++++++++++++++++++++++++
1 file changed, 35 insertions(+)
create mode 100755 devel/draw-thread
fwiw, i think this tool is useful enough for debugging and visibility
that we should go ahead and include it in the notmuch source, under
devel/ -- we don't need to ship it publicly, but it'd be great to have
it easily available to all notmuch devs who might be dealing with this
kind of thing.

--dkg

Naveen N. Rao
2018-04-18 10:18:00 UTC
Permalink
Post by David Bremner
Post by David Bremner
At least some of this mail data is public, but I'm not sure if the bad
threading is reproducible or not; I want to run a complete census
overnight before I reindex.
Even if the bug is non-deterministic, it probably lives in lib/add-message.cc
I have a reproducible test for this bug now
http://pivot.cs.unb.ca/git?p=notmuch.git;a=shortlog;h=refs/heads/fix/thread-search
Thanks for looking into this.
Post by David Bremner
I still need to analyze the mails a bit more, but it looks like at least
one of the strange results is caused by multiple mail files sharing the
same message-id, but with different References headers (and no
In-Reply-To headers).
In my case, I seem to be having the In-Reply-To headers. I end up with
two files per message: one from my inbox and one from the gmane archive
that I pull in. All the messages from the gmane archive seem to have a
re-written 'In-Reply-To' header, but 'Message-Id' and 'References' are
the same.

In the problematic email thread, all other files/messages get allotted a
single thread except for one of the messages. The offending message has
3 references compared to 1 or 2 references for the rest, but I don't
know if that's relevant here.

- Naveen
David Bremner
2018-04-22 00:45:49 UTC
Permalink
Post by Naveen N. Rao
In my case, I seem to be having the In-Reply-To headers. I end up with
two files per message: one from my inbox and one from the gmane archive
that I pull in. All the messages from the gmane archive seem to have a
re-written 'In-Reply-To' header, but 'Message-Id' and 'References' are
the same.
That sounds like essentially the same issue, due to the fact that
notmuch prefers In-Reply-To when choosing a parent for a message.

Currently the database is correct (or at least one not-crazy definition
of correct): all of the reference and in-reply-to terms are attached to
the message document in the database. On the other hand, the in memory
data structures currently assume that In-reply-to is a unique value
(with ties broken at indexing time).

It might be that the solution is to read a list of in-reply-to values
and use all of them in threading. At a quick glance, that looks doable;
I'm just not sure about unintended consequences.

d
Naveen N. Rao
2018-06-28 10:36:05 UTC
Permalink
Hi David,
Post by David Bremner
Post by Naveen N. Rao
In my case, I seem to be having the In-Reply-To headers. I end up with
two files per message: one from my inbox and one from the gmane archive
that I pull in. All the messages from the gmane archive seem to have a
re-written 'In-Reply-To' header, but 'Message-Id' and 'References' are
the same.
That sounds like essentially the same issue, due to the fact that
notmuch prefers In-Reply-To when choosing a parent for a message.
Currently the database is correct (or at least one not-crazy definition
of correct): all of the reference and in-reply-to terms are attached to
the message document in the database. On the other hand, the in memory
data structures currently assume that In-reply-to is a unique value
(with ties broken at indexing time).
It might be that the solution is to read a list of in-reply-to values
and use all of them in threading. At a quick glance, that looks doable;
I'm just not sure about unintended consequences.
Were you able to look into this again?
Using a list of in-reply-to values sounds like a good option, though I
clearly have no idea about other consequences from that. If you have a
patch, I can help test that.

Thanks,
Naveen
David Bremner
2018-06-30 13:42:23 UTC
Permalink
Post by Naveen N. Rao
Were you able to look into this again?
Using a list of in-reply-to values sounds like a good option, though I
clearly have no idea about other consequences from that. If you have a
patch, I can help test that.
Sorry I haven't made any progress on this. Thanks for the reminder.

d
David Bremner
2018-08-30 12:52:20 UTC
Permalink
Post by David Bremner
Post by Naveen N. Rao
Were you able to look into this again?
Using a list of in-reply-to values sounds like a good option, though I
clearly have no idea about other consequences from that. If you have a
patch, I can help test that.
Sorry I haven't made any progress on this. Thanks for the reminder.
d
It's not much progress but I did manage to make a test case.

id:20180730224555.26047-16-***@tethera.net

As it says in the commit message, its not 100% clear this is your
problem, but it is a bug, and hopefully fixing it will help your issue.

d
Naveen N. Rao
2018-09-06 10:50:29 UTC
Permalink
Post by David Bremner
Post by David Bremner
Post by Naveen N. Rao
Were you able to look into this again?
Using a list of in-reply-to values sounds like a good option, though I
clearly have no idea about other consequences from that. If you have a
patch, I can help test that.
Sorry I haven't made any progress on this. Thanks for the reminder.
d
It's not much progress but I did manage to make a test case.
As it says in the commit message, its not 100% clear this is your
problem, but it is a bug, and hopefully fixing it will help your issue.
Thanks for continuing to look into this!

The test is close to what I have -- the only difference in my case is
that the Message-ID and References: fields match in the duplicate mail
files, but just the In-reply-to headers differ (the gmane one has a
meaningless/incorrect, re-written header).

Interestingly though, I re-checked the thread I had the original problem
with and notmuch seems to be able to cope with it better now. So, some
other changes seem to have helped with my original problem. I will keep
an eye out to see if any other threads cause problems (I do occasionally
see astroid crash, but I haven't seen if it is due to this issue with
notmuch or a different problem).

[09/06 16:19:19 ~]$ notmuch --version
notmuch 0.27
[09/06 16:19:20 ~]$ notmuch search --output=threads thread:00000000000c4d20
thread:00000000000c4d20
[09/06 16:19:22 ~]$ notmuch search --output=threads thread:00000000000c4d1e
thread:00000000000c4d1e
[09/06 16:19:26 ~]$ notmuch search --output=messages thread:00000000000c4d20
id:***@arbab-laptop.localdomain
id:CAKTCnzngahex_sL2raoHFuXqTxgVV7a57R9YmcT1TN-***@mail.gmail.com
id:CAOSf1CFy0im+***@mail.gmail.com
id:20180405071500.22320-4-***@gmail.com
id:20180405071500.22320-3-***@gmail.com
id:20180405071500.22320-2-***@gmail.com
id:20180405071500.22320-1-***@gmail.com
[09/06 16:19:35 ~]$ notmuch search --output=messages thread:00000000000c4d1e
id:CAKTCnzngahex_sL2raoHFuXqTxgVV7a57R9YmcT1TN-***@mail.gmail.com


- Naveen
Loading...