Technical information

(Source: http://corpus.byu.edu/time/.)

The basic architecture and interface is similar to other large corpora that I have placed online. The functionality of the corpus is due in large part to the architecture upon which it is based.  There are central n-gram databases that contain information on each of the 100 million words in the corpus. These are linked to "register info" tables that allow for the analysis of register variation, as well as separate tables for synonyms, lemmas forms over time, etc.

The speed of the corpus is due both to the database architecture, as well as the hardware on which it is housed.  The Windows 2003 Server machine has dual quad-core 2.8 GHz Xeon processors with 16 GB RAM, and the disk subsystem has been specially configured for this particular application.  The SQL Server 2005 databases are stored on five 15,000 rpm SCSI hard drives, which are linked together in a RAID 5 configuration, with two ultra-fast SCSI RAID controllers, and temp tables, the OS, and the web pages themselves on separate 15,000 rpm SCSI drives.  These main tables also have indexes on all of the relevant table columns, as well as clustered indexes on the most crucial columns.  All of this provides for very fast searches – usually less than one second for even the most complicated queries involving word forms, parts of speech, frequency, and register information.

The actual corpus was converted from raw text files that were received from a number of sources listed above. I then converted these texts to a 100 million word flat file, imported the flat file into SQL Server, created the subsequent n-grams databases, formulated the SQL commands to link all of the tables and extract the desired data, and carried out all of the HTML / ASP / ADO / Javascript coding for the Web interface.