In order to organize the concepts extracted and documented, it is necessary to identify and name a limited set of unifying dimensions or perspectives. Each of the concepts
is then associated with one of these taxonomic dimensions. Concept organization requires the development of a controlled vocabulary set. The purpose of a controlled vocabulary is to organize
information and to provide terminology to catalog and retrieve information. While capturing the richness of variant terms, controlled vocabularies also promote consistency in preferred terms
and the assignment of the same terms to similar content. Constructing a controlled vocabulary requires deliberate development.
- Identify terms
- Pull together synonyms, disambiguate homographs
- Identify relationships between terms
- Make those relationships explicit in your metadata
There are a few ways to build the controlled vocabulary set.
Term list
The simplest kind of controlled vocabulary is a flat term list, sometimes called a ‘pick list’. Term lists are often utilized for administrative and structural metadata elements,
such as a list of possible file formats, rights status or retention status. Term lists are also used in descriptive metadata elements, such as content type, language, department/source,
etc. Controlled vocabularies of subject terms, however, may be too large and complex for simple term lists. Term lists are often displayed within drop-down boxes for a field, but could
display as button or check-box items.
Authority file
An authority file is a controlled vocabulary which includes synonyms or variants for each term which function as cross-references to guide the user from an ‘non-preferred term’ variant
to the equivalent ‘preferred term’. In addition, authority files may provide a note for each term as to the authoritative source for the preferred term. The designation ‘authority file’
is used more often with named entities (proper nouns) only, and often authority files are simply called ‘controlled vocabularies’.
Thesaurus
The classic meaning of a thesaurus is a kind of dictionary which contains synonyms or alternate expressions for each term and possibly even antonyms. An information/content retrieval
thesaurus shares this characteristic of listing similar terms at each controlled vocabulary term entry. The difference is that in a dictionary-thesaurus all the associated terms might
be used in place of the term entry depending upon the specific context, which the user needs to consider in each case. The content retrieval thesaurus, on the other hand, is designed
for all contexts, regardless of a specific term usage or document. The synonyms or near-synonyms must therefore be suitably equivalent in all circumstances. A content retrieval thesaurus
is also more structured than either a dictionary thesaurus or other types of controlled vocabularies, because it provides information about each term and its relationships to other
terms within the same thesaurus. In addition to specifying which terms can be used as synonyms (labeled as ‘used from’), a thesaurus also indicates which terms are more specific
(narrower terms), which are broader, and which are non-hierarchically related terms. In addition, some terms have scope note explanations, as needed.