Document & Collection Analysis Engine (cont.)

Index document collection
- normalize words to essential roots or tokens
  - removal of suffices and prefaces
  - stem transformation
  - e.g., computers -> computer -> compute
- compute frequency of occurrence of each word token across document collection (TOTFREQ)
- maintain total number of words in collection
- dynamic noise word removal
  - high and low frequency tokens
  - based on ratio of DOCFREQ/TOTFREQ

Previous slide Next slide Back to first slide View graphic version