GeekSpeak
re: It seems like the following to me
Posted by Robert on 4/20/2008 11:35:59 AM
In Reply to: It seems like the following to me posted by Yeti on 4/20/2008 10:55:35 AM
That's a good analysis with good ideas, but even the "low-feature" approach requires a non-trivial amount of work.

Messages are stored in the form of threads in XML files; each thread file contains all its messages and organizes them into the thread structure. There are traditional tabular databases used for indexing, mainly useful to keep track of a message's location in the event that it or the thread of which it's a member is moved. The indexes don't redundantly store the message body, and of course if they did, grafting on a quick-n-dirty search would be much easier.

XML storage is best at supporting the routine operations of the board, but its one big disadvantage is that it doesn't lend itself to searching. For that, I'd have to, at a minimum, copy the text of every message into a DB table, and use SQL to search the text field. A better approach is to parse the words in each message out into a lexical tree and AND the hit sets to find messages with all search words. And then there's phrase recognition...well, all that's about doing search right.

Back to "low-feature"...redacting names from the message base would be a massive undertaking in its own right, given the architecture. We're talking about editing somewhere around 80,000 XML files, and 400,000+ database records, and maintaining the integrity of the substitutions...that is, for example, "D***** L****" and "DL" and "Yeti" have to resolve to a single user. I've analyzed handles before, and found the expected human variability in capitalization, punctuation, and spelling. Over the years one user has been "EP", "E.P.", "EP Grondine" and "E.P. Grondine" and quite a few misspellings and mispunctuations (e.g. "E.P grondine"). This is a fuzzy search set!

It may be tempting to say that it's not necessary to be so careful with "low-feature", but screwing it up could make it very hard to do the job right later. I don't want to lose the association between posters and their messages. That would be rather unfortunate, eh what?

Table of contents
Replies:
Message URL / 71.198.194.19 / Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.14) Gecko/20080404 Firefox/2.0.0.14