Fwd: [dm-l] Automatically scan two text corpora for quotations

3 Jan 2012


      ...
From a colleague in corpus linguistics
---------- Forwarded message ----------
From: Catherine Smith casmith@morris.umn.edu
Date: Mon, Jan 2, 2012 at 2:27 PM
Subject: Re: [dm-l] Automatically scan two text corpora for quotations
To: Stephen Martin scmartin@morris.umn.edu
Yes, off the top of my head, I can think of three options:
1. Monoconc (http://www.athel.com/mono.html) by Michael Barlow (
   http://www.michaelbarlow.com/) should be able to handle this kind of
   task.
   2. Any scholar can easily learn how to program using Stefan Gries's (
   http://www.linguistics.ucsb.edu/faculty/stgries/) open-access
   programming language entitled "R" (
   http://zoonek2.free.fr/UNIX/48_R/02.html) on which Gries has written the
   book entitled _Quantitative Corpus Linguistics with R_.(
   http://books.google.com/books/about/Quantitative_corpus_linguistics_with_R.h...).
    Of course, other languages work well, too (e.g., Pascal, Perl).
   3. Ask an NAU (Northern Arizona U) corpus linguist to write a program
   for you (e.g., I might be able to do it, but there would be a delay since
   another scholar is using me at the moment; the scholar could contact Doug
   Biber, http://jan.ucc.nau.edu/biber/, or I could do so on his or her
   behalf, and ask if anyone has a program already or can write one).  Of
   course, students or graduates of other computational linguistics programs
   may be happy to offer assistance, too.
It should be noted that the above approaches would probably rely on the use
of either morphemic roots or "wild card" symbols; this would allow the
analyst to search for strings of characters and variations in these
strings.  If this approach were inadequate, then there might be a delay if
a part-of-speech (POS) tagger needed to be located; there would be a
substantial delay if a POS tagger needed to be written.
In addition, if the digital versions of the texts have been left intact,
then it is likely that they contain variations in spelling.  So, the first
step would be to produce a copy of the corpus in which spelling variations
were located and replaced with consistent spelling.  This could be done
using a "search-and-replace" function to capture as many occurrences as
possible.
Furthermore, it should be understood that these approaches may yield a
percentage of observations (e.g., 85%) that could then be used for further
analysis.  Depending on the extent of linguistic variation in the original
corpus, it may or may not be possible to extract 100% of the occurrences.
In any case, I should think that the project is most do-able and most
interesting.  So, I hope these initial thoughts are helpful, and I would be
happy to be of further assistance.
Catherine
Catherine Smith, Ph.D., Applied Linguistics; Lecturer, Discipline of
English; Associate Editor, *Second Language Writing News, *TESOL; Personal
Web Site:  sites.google.com/site/casmith099; Email:  casmith@umn.edu;
Surface Mail:  University of Minnesota, Morris, Division of the Humanities;
600 E 4th Street, Morris, MN 56267 USA

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Fwd: [dm-l] Automatically scan two text corpora for quotations