Document & Collection Analysis Engine (cont.)
Index document collection
- normalize words to essential roots or tokens
- removal of suffices and prefaces
- stem transformation
- e.g., computers -> computer -> compute
- compute frequency of occurrence of each word token across document collection (TOTFREQ)
- maintain total number of words in collection
- dynamic noise word removal
- high and low frequency tokens
- based on ratio of DOCFREQ/TOTFREQ