


TAPoR Toronto
The Text Analysis Portal for Research (TAPoR) at the University of Toronto consists of laboratories for lexical text analysis at Robarts Library, and for usability and interaction studies at the Faculty of Information Studies. TAPoR Toronto is one node of the central TAPoR project, based at McMaster University and consisting of a network of six
of the leading Humanities computing centres in Canada. The other five are at McMaster, the University of Victoria (in collaboration with Malaspina UC), the University of Alberta, Université de Montreal (law), and the University of New Brunswick.

Text analysis, beginning in the late nineteenth century, applied statistics on the frequency and distribution of words and phrases in literary works to such questions as the authorship of a text, its date, and indebtedness to sources. The publication of the Brown Corpus of American English in the 1960s created corpus linguistics, the study of a language from statistical samples of its living discourse, as a field for the application of text-analysis tools. Eventually, computer programs other than statistical systems were added to the technologies used by literary and linguistic text analysis. Such tools now include interactive concordancers, XML browsers, and usability software. Because these tools have more functionality, text analysis has grown with them. It now encompasses non-statistical forms of textual study, such as editorial encoding, stemmatic analysis, database creation, and free text-exploration. The availability of huge libraries of electronic texts has also encouraged students of language and literature who do not understand statistics to use ready-made, often simple computer programs as a method that assists in manual critical reading.
In the TAPoR Toronto laboratory, we approach text analysis lexically by applying database and textbase technologies to create both new dictionaries and editions or databases of old dictionaries. Initially, word- and phrase-concordances help us determine the properties of semantically-dense glossaries and lexicons. We then build encoding languages for these works that enable us to import them into databases. These two technologies, database and textbase, are converging. SQL databases enable us to predict and control complex researcher queries and to integrate search tools with scholarly works such as critical editions and bibliographies. Textbases prepared by software like XTeXT allow for less editorial control but not only make possible highly sophisticated information retrieval through algorithms and techniques developed in computer science, but also enable lexical databases to draw from massive collections of e-texts.
Our lexical projects use computer technologies to make possible the creation of scholarly resources that would have been impossible with manual methods. Dictionaries, as general-purpose tools for all disciplines, truly belong on the Web, where anyone from a high-school student in Whitehorse to a corpus linguist in Oslo can use them simultaneously. Lexical research infrastructure such as our projects provide, for English and French (the two principal languages of Canada), contribute to the semantic Web.
Ian Lancashire
University of Toronto
August 19, 2005