Thesis
Exploring potential improvements to term-based clustering of web documents
Washington State University
Master of Science (MS), Washington State University
2007
Handle:
https://hdl.handle.net/2376/107082
Abstract
With the size and diversity of the information content on the Web growing at unfathomable pace, retrieval of desired or pertinent information becomes an increasingly difficult task. Particularly difficult is the problem of associating intended meanings of queries to their textual representation in those domains where different meanings might have nearly identical textual representation. This motivates the search for document features, other than words, that are able to express semantic relationships. We present a method for using character patterns in the space of non-words as a document feature to aids in distinguishing semantics of Web documents. We test the value of such a concept by devising non-word patterns through observation. We then develop an automated method for learning non-word patterns from a corpus of documents. Finally, through a series of document classification experiments, we are able to show the pertinence of non-word patterns in document classification.
Metrics
2 File views/ downloads
7 Record Views
Details
- Title
- Exploring potential improvements to term-based clustering of web documents
- Creators
- Damir Arac̆ić
- Contributors
- Scott Andrew Wallace (Degree Supervisor)
- Awarding Institution
- Washington State University
- Academic Unit
- Electrical Engineering and Computer Science, School of
- Theses and Dissertations
- Master of Science (MS), Washington State University
- Publisher
- Washington State University; Pullman, Wash. :
- Identifiers
- 99900525300101842
- Language
- English
- Resource Type
- Thesis