Why not leave the core storage the same then just write your own spidering code to walk all the trees and build a DB that contains the search information and populates a database. search against the DB yields just a list of pointers to threads (limit 10). Then when the view requests the thread via the search interface, just serve it up as any-old-request...
I have never implemented a MySQL search of this type before for a message board, and this domain is so very not my cup of tea.... but off the top of my head and for fun...
table threads
uid|pointer_to_thread_file|uid_of_info|
table thread_info
uid|pointer to participants table uid|dict|earliest|latest
table participants
uid| participants
This would be my first hack at a first approximate table layout. I don't thing this would meet normal form tests- but whatever, the tables would be easy to refactor- unless I have something bass ackwards, which is quite possible because this is not my thing and this is just off the top of my head.
Probably to beautify the tables, make thread_info just possess pointers to other tables for dict, earliest, latest etc.
Let the thread_info.dict just be a basic 'unique' of the words in the thread- remove common words ('and', 'the' etc). Let your crawler compute this. thread_info.earliest and thread_info.latest could be datetime. participants.participants would just be a uniqued list of participants.
For speed, you could probably index the thread_info table off the participants uid- this would be the hardest table to search.
MySQL is good at string matching in my experience, but some optimization here might be good- that is by nature time consuming. It seems like you should be able to figure out some optimization on Frequent Topics like WMD, Iraq, Iran Suni Shia etc- but I have not thought that through. Sorting the participants field and making sure the field is unique -or-recycled might help you save entries there (like I bet a LOT). Possibly pulling out the dict field into a table and sorting, unique-ing might surprise you with the number of entries there you can eliminate.
Anyway- I'm sure somewhere there is a root index file or file tree or such can be built pretty fast for your spider to start walking. As long as your 'spider' is efficient and won't eat itself out of memory and halt, you should be able to let it run every night for several weeks and eventually build the database up. Once the database is up, then you should just need a bit of code (injection safe, of course) to take things like
Author: *****
Time after: 11/11/2003
Time before: 11/11/2004
Keywords: Die, Liberal, Scum, Commie, Pinko
And make them into SQL queries.
I don't know how you'll defend against people slamming your system with absurdly large queries. I never have done a system that did real-time query with string matching against such a huge DB.
Trying not to do string matching inside a huge table (or where it will be tempted to parse into thousands of entries ) would be a good idea, but I have no idea about how to do that without creating an absurdly large DB, which would in turn shoot you in th foot in so many ways its not even funny.
When you unique the words, try also to sum the instances count. Information theory would suggest that all the words that have extremely high frequencies globally will be absolutely irrelevant to semantics and thus also to search (words like 'how' or 'think' or etc), and this will probably be a smarter way of finding out irrelevant words than guessing.
So phrase-searching obviously wouldn't be there on this idea, but you could get key word search. For phrase searching you would have to have the whole text of the thread... and that would take some resources... At that point better to index down to the message level- IE have a message level table with full text field, author field and pointer to the thread uid. I would still have a "dict" field in play so you can narrow on keyword before you search message text for phrase. So you should be able to get down to a hand full of threads based on author, time, keyword, and then reduce the painful search of the full text to just the messages in those threads.
Anyway, just some thinking aloud- obviously take this all with a pound of salt, because this aint really my thing.
Yeti