Yikes...but your conception of the problem is similar to mine, and I think (hope) that as you outline it, you start to get some sense of the scope of the effort required.
The existing tables are wired up sort of as you described, using message IDs as the GUIDs that bind tables. Because XML files package entire threads, there's a table that maps individual message IDs to parent thread IDs. The major shortcoming is that I don't store message text in tables.
I agree that parsing message text into word tables beats the snot outta trying to search a large text field across 400,000+ records. This is how all non-trivial search engines do it. As I understand it, it's a multi-stage process, in which each of the terms is searched, returning an individual result set; and then the union of the result sets is determined. Most search engines seem to take it further (but this is where I'm trying to blackbox proprietary algorithms) by computing "distance" between words in the source text as a way to rank relevancy, which implies that each word record lists not only the source document, but a character offset within it. Ever notice that you get different results at Google simply by changing the order of search terms? They're obviously doing more than a simple union-of-sets search.
That latter observation seems to suggest a way to do phrase searching without having to resort to full-text searching. It should be possible to refer to the (hypothesized) character offset to order the results to determine whether they form the requested phrase. Doing it this way may explain why Google phrase searching isn't thrown off by extraneous punctuation and the like inside the quote marks.
I'm a little leery of using an automatic statistical approach to creating a list of words to exclude as common (or "junk" words), at least when it's based on the same corpus as the one to which it's applied. What if relevant words like "Die", "Liberal", "Scum", "Commie", "Pinko" are used obsessively and show up as commonly as "the"? Better, I think, to find a public domain list of the most common junk words, because such a list would be based on a much larger sample.
So many juicy technical problems, so little time....