CsPTools
CsPTools is a package of perl utilities (developed in the course of
the project
on auxiliary selection in the history of English that I was a part
of at the University of Stuttgart) which assist in the processing of
corpus files and especially the files output by queries with Beth
Randall's CorpusSearch program (now available for download from this SourceForge
page). It consists of a perl module and ten scripts that depend on
it. A lot of the scripts presuppose a certain way of organizing
searches and thus may not be of interest for everyone, but the most
important ones should be useful for just about anyone working with
CorpusSearch. CsPTools is free, open source
software released under the Gnu General Public
License. A recurring feature in the scripts, which was
particularly simple to implement in perl, is that most kinds of user
input for which it would make sense can be specified as regular
expressions. Brief descriptions of the scripts follow. For more
detailed information see the manual.
Documentation
The CsPTools manual
is available in [html] and [pdf] formats, and is included in
the full distribution (also in both formats). It contains full
descriptions of all the scripts along with examples of their use, as
well as more extensive installation instructions.
The components of
CsPtools
-
analyzer
analyzer processes and reports on the coding strings in a
CS output file. It is especially designed to deal with multicolumn
codes and has several options for filtering and formatting the
output, even producing simple
tables.
- autocs
autocs is purely a
front end to CorpusSearch. It allows you to use one command to run
separate searches on multiple files, enforcing strong naming
conventions so that you can tell from the name of an output file
where it came from and what queries were run to obtain it.
- codefinder
codefinder
gives you a quick and flexible way to get information about the
coding string used in a corpus file. In distinction to
analyzer, it is designed primarily to search for errors and
monitor progress while hand-coding
files.
- tagfinder
tagfinder
allows the user to search for the occurrences of words with a given
part-speech-tag. The results are listed by form, either
alphabetically or in order of frequency of occurrence, for each
reporting either the number of occurrences or the line numbers of
the actual occurrences. It has largely been superseded by the
lexicon functions of CS, but still does a few things differently
that might make it useful under certain circumstances (e.g. if you
need regular
expressions). - editcode
You
give editcode a regular expression describing coding
strings, and a list of coding files. It then finds all the examples
in those files with a coding string that the regex matches, and
opens them for editing one by one in
emacs.
- integratecodes
integratecodes is a tool for transferring coding strings
between corpus files. This makes it possible, e.g., to use
CorpusSearch to extract a subset of the sentences from some (set of)
coded corpus file(s), hand-edit the coding strings that appear
there, and then integrate the changes back into the original
files. See the manual for some examples demonstrating how this can
be useful.
- ipcoding
ipcoding is a
simple script that converts coding files output by earlier versions
of CorpusSearch to the format used in CS version 2. Specifically, it
moves the CODING node inside the IP
node.
- progress
progress is a
very simple script that uses facilities originally created for
codefinder to measure and report on progress in
hand-editing automatically generated coding strings in a file or set
of files.
- next
next is just a time
and typo saver for hand editing coding files. Using the same
facilities as progress, it determines which file in a
directory is the next to be hand edited and opens it in emacs. When
the user closes emacs, next runs codefinder on it,
reporting the statistics to the user, offering to clean up back up
files, and then opening the next
file.
- mvcodh
mvcodh is a very
specialized tool that renames the files in a given directory in
order to facilitate the use of file-naming conventions developed for
progress and next.
Requirements/dependencies
CsPTools requires perl v5.6 or later. editcodes and
next require emacs, as do the editing functions of
analyzer, though the basic functions of the latter are
independent. The scripts have been developed and used on Linux and
Mac OSX, and should also work on other Unix-like systems. I would be
frankly surprised if everything worked as is on a DOS/Windows system
or Mac OS 9.x or earlier. However, I would be interested in making
them work on such systems in the future, so if anybody out there
actually tries it out, let me know how it turns
out.
Downloads and Installation
CsPTools is available as a zip archive containing the entire
package, which consists of the ten tools, the perl module
CsPTools.pm which all of them depend on, and the manual formatted as
PDF and HTML.
At the moment there is no automated
installation program, but installing by hand should be reasonably
painless, and there are instructions for doing so in the manual.
CsPTools v0.3.1 (July 26th, 2005): [CsPTools.zip]