Hi have a question somebody here may know the answer to.
A colleague of mine is scanning back issues of journal she edits for online publication. She is using PDF with OCR to provide full-text searchability a la JSTOR. The issue is that the file sizes are really quite different. A 30pp article from a 1920s issue of Speculum, for example, seems to come in about 1.5-2.0 MB; 5-6 page article in my colleagues journals are coming in about the same size, and other files are well over 4 MB.
I haven't seen the settings used for the scanning or OCR yet, but the JSTOR and her files appear to be about the same resolution (eyeballing the page size when things are set to 100%). They look like they are being scanned in B&W, but I haven't checked (perhaps a colour channel is adding to the bulk?). Any other suggestions for things that might be causing the files to be abnormally large?
-dan
On 5/21/2010 5:00 PM, O'Donnell, Dan wrote:
Hi have a question somebody here may know the answer to.
A colleague of mine is scanning back issues of journal she edits for online publication. She is using PDF with OCR to provide full-text searchability a la JSTOR. The issue is that the file sizes are really quite different. A 30pp article from a 1920s issue of Speculum, for example, seems to come in about 1.5-2.0 MB; 5-6 page article in my colleagues journals are coming in about the same size, and other files are well over 4 MB.
I haven't seen the settings used for the scanning or OCR yet, but the JSTOR and her files appear to be about the same resolution (eyeballing the page size when things are set to 100%). They look like they are being scanned in B&W, but I haven't checked (perhaps a colour channel is adding to the bulk?). Any other suggestions for things that might be causing the files to be abnormally large?
-dan
Jstor can do that to you - I had a version of that today, trying to print a 21' X 12" menu on a regular sheet of paper. I finally found the correct "fit to page" box: there were 4 different places that purported ro be what I wanted, and each of them had to be changed by hand.
I think it will depend a great deal on what software is being used to create the PDFs, and from what source (is she scanning TIFFs and creating PDFs from them, or scanning directly to PDF using Acrobat, or something else?) It appears that JSTOR is using Atypon Systems PDFPlus software (http://www.atypon.com/solutions/pdfplus/) to generate their PDFs on the fly, which might explain why theirs are smaller (I don't know anything about this software, but smaller sizes seems to be one of their selling points).
________________________________________ From: dm-l-bounces@uleth.ca [dm-l-bounces@uleth.ca] On Behalf Of NORMAN [normanhinton@sbcglobal.net] Sent: Friday, May 21, 2010 8:40 PM To: dm-l@uleth.ca Subject: Re: [dm-l] JSTOR and PDF Sizes (TAN)
On 5/21/2010 5:00 PM, O'Donnell, Dan wrote:
Hi have a question somebody here may know the answer to.
A colleague of mine is scanning back issues of journal she edits for online publication. She is using PDF with OCR to provide full-text searchability a la JSTOR. The issue is that the file sizes are really quite different. A 30pp article from a 1920s issue of Speculum, for example, seems to come in about 1.5-2.0 MB; 5-6 page article in my colleagues journals are coming in about the same size, and other files are well over 4 MB.
I haven't seen the settings used for the scanning or OCR yet, but the JSTOR and her files appear to be about the same resolution (eyeballing the page size when things are set to 100%). They look like they are being scanned in B&W, but I haven't checked (perhaps a colour channel is adding to the bulk?). Any other suggestions for things that might be causing the files to be abnormally large?
-dan
Jstor can do that to you - I had a version of that today, trying to print a 21' X 12" menu on a regular sheet of paper. I finally found the correct "fit to page" box: there were 4 different places that purported ro be what I wanted, and each of them had to be changed by hand.
Digital Medievalist -- http://www.digitalmedievalist.org/ Journal: http://www.digitalmedievalist.org/journal/ Journal Editors: editors _AT_ digitalmedievalist.org News: http://www.digitalmedievalist.org/news/ Wiki: http://www.digitalmedievalist.org/wiki/ Twitter: http://twitter.com/digitalmedieval Facebook: http://www.facebook.com/group.php?gid=49320313760 Discussion list: dm-l@uleth.ca Change list options: http://listserv.uleth.ca/mailman/listinfo/dm-l
In Adobe Acrobat Professional (I have version 8), there is a panel under the "Advanced" menu called the "PDF Optimizer" which allows you to figure out where the bulk is and add some compression, delete croppings, etc. I've had some luck squeezing files using that.
Jesse
On Fri, May 21, 2010 at 4:00 PM, O'Donnell, Dan daniel.odonnell@uleth.ca wrote:
Hi have a question somebody here may know the answer to.
A colleague of mine is scanning back issues of journal she edits for online publication. She is using PDF with OCR to provide full-text searchability a la JSTOR. The issue is that the file sizes are really quite different. A 30pp article from a 1920s issue of Speculum, for example, seems to come in about 1.5-2.0 MB; 5-6 page article in my colleagues journals are coming in about the same size, and other files are well over 4 MB.
I haven't seen the settings used for the scanning or OCR yet, but the JSTOR and her files appear to be about the same resolution (eyeballing the page size when things are set to 100%). They look like they are being scanned in B&W, but I haven't checked (perhaps a colour channel is adding to the bulk?). Any other suggestions for things that might be causing the files to be abnormally large?
-dan
-- Daniel Paul O'Donnell Professor of English University of Lethbridge
Chair and CEO, Text Encoding Initiative (http://www.tei-c.org/) Co-Chair, Digital Initiatives Advisory Board, Medieval Academy of America President-elect (English), Society for Digital Humanities/Société pour l'étude des médias interactifs (http://sdh-semi.org/) Founding Director (2003-2009), Digital Medievalist Project (http://www.digitalmedievalist.org/)
Vox: +1 403 329-2377 Fax: +1 403 382-7191 (non-confidential) Home Page: http://people.uleth.ca/~daniel.odonnell/
Digital Medievalist -- http://www.digitalmedievalist.org/ Journal: http://www.digitalmedievalist.org/journal/ Journal Editors: editors _AT_ digitalmedievalist.org News: http://www.digitalmedievalist.org/news/ Wiki: http://www.digitalmedievalist.org/wiki/ Twitter: http://twitter.com/digitalmedieval Facebook: http://www.facebook.com/group.php?gidI320313760 Discussion list: dm-l@uleth.ca Change list options: http://listserv.uleth.ca/mailman/listinfo/dm-l
Fabricando fabri fimus. Sent from my iPad
On May 24, 2010, at 6:23 PM, Jesse Hurlbut jesse_hurlbut@byu.edu wrote:
In Adobe Acrobat Professional (I have version 8), there is a panel under the "Advanced" menu called the "PDF Optimizer" which allows you to figure out where the bulk is and add some compression, delete croppings, etc. I've had some luck squeezing files using that.
Jesse
On Fri, May 21, 2010 at 4:00 PM, O'Donnell, Dan daniel.odonnell@uleth.ca wrote:
Hi have a question somebody here may know the answer to.
A colleague of mine is scanning back issues of journal she edits for online publication. She is using PDF with OCR to provide full-text searchability a la JSTOR. The issue is that the file sizes are really quite different. A 30pp article from a 1920s issue of Speculum, for example, seems to come in about 1.5-2.0 MB; 5-6 page article in my colleagues journals are coming in about the same size, and other files are well over 4 MB.
I haven't seen the settings used for the scanning or OCR yet, but the JSTOR and her files appear to be about the same resolution (eyeballing the page size when things are set to 100%). They look like they are being scanned in B&W, but I haven't checked (perhaps a colour channel is adding to the bulk?). Any other suggestions for things that might be causing the files to be abnormally large?
-dan
-- Daniel Paul O'Donnell Professor of English University of Lethbridge
Chair and CEO, Text Encoding Initiative (http://www.tei-c.org/) Co-Chair, Digital Initiatives Advisory Board, Medieval Academy of America President-elect (English), Society for Digital Humanities/Société pour l'étude des médias interactifs (http://sdh-semi.org/) Founding Director (2003-2009), Digital Medievalist Project (http://www.digitalmedievalist.org/)
Vox: +1 403 329-2377 Fax: +1 403 382-7191 (non-confidential) Home Page: http://people.uleth.ca/~daniel.odonnell/
Digital Medievalist -- http://www.digitalmedievalist.org/ Journal: http://www.digitalmedievalist.org/journal/ Journal Editors: editors _AT_ digitalmedievalist.org News: http://www.digitalmedievalist.org/news/ Wiki: http://www.digitalmedievalist.org/wiki/ Twitter: http://twitter.com/digitalmedieval Facebook: http://www.facebook.com/group.php?gidI320313760 Discussion list: dm-l@uleth.ca Change list options: http://listserv.uleth.ca/mailman/listinfo/dm-l
Digital Medievalist -- http://www.digitalmedievalist.org/ Journal: http://www.digitalmedievalist.org/journal/ Journal Editors: editors _AT_ digitalmedievalist.org News: http://www.digitalmedievalist.org/news/ Wiki: http://www.digitalmedievalist.org/wiki/ Twitter: http://twitter.com/digitalmedieval Facebook: http://www.facebook.com/group.php?gidI320313760 Discussion list: dm-l@uleth.ca Change list options: http://listserv.uleth.ca/mailman/listinfo/dm-l
Apologies for sending the blank.
I wanted to add that I get even better compression now using v9 Pro. I've seen my scans go from 16 MB to a fee hundred KB!
Per Jesse's instructions, click Advanced, then PDF Optimizer. You can even create a batch script to automate the optimization.
Keith
Fabricando fabri fimus. Sent from my iPad
On May 24, 2010, at 6:23 PM, Jesse Hurlbut jesse_hurlbut@byu.edu wrote:
In Adobe Acrobat Professional (I have version 8), there is a panel under the "Advanced" menu called the "PDF Optimizer" which allows you to figure out where the bulk is and add some compression, delete croppings, etc. I've had some luck squeezing files using that.
Jesse
On Fri, May 21, 2010 at 4:00 PM, O'Donnell, Dan daniel.odonnell@uleth.ca wrote:
Hi have a question somebody here may know the answer to.
A colleague of mine is scanning back issues of journal she edits for online publication. She is using PDF with OCR to provide full-text searchability a la JSTOR. The issue is that the file sizes are really quite different. A 30pp article from a 1920s issue of Speculum, for example, seems to come in about 1.5-2.0 MB; 5-6 page article in my colleagues journals are coming in about the same size, and other files are well over 4 MB.
I haven't seen the settings used for the scanning or OCR yet, but the JSTOR and her files appear to be about the same resolution (eyeballing the page size when things are set to 100%). They look like they are being scanned in B&W, but I haven't checked (perhaps a colour channel is adding to the bulk?). Any other suggestions for things that might be causing the files to be abnormally large?
-dan
-- Daniel Paul O'Donnell Professor of English University of Lethbridge
Chair and CEO, Text Encoding Initiative (http://www.tei-c.org/) Co-Chair, Digital Initiatives Advisory Board, Medieval Academy of America President-elect (English), Society for Digital Humanities/Société pour l'étude des médias interactifs (http://sdh-semi.org/) Founding Director (2003-2009), Digital Medievalist Project (http://www.digitalmedievalist.org/)
Vox: +1 403 329-2377 Fax: +1 403 382-7191 (non-confidential) Home Page: http://people.uleth.ca/~daniel.odonnell/
Digital Medievalist -- http://www.digitalmedievalist.org/ Journal: http://www.digitalmedievalist.org/journal/ Journal Editors: editors _AT_ digitalmedievalist.org News: http://www.digitalmedievalist.org/news/ Wiki: http://www.digitalmedievalist.org/wiki/ Twitter: http://twitter.com/digitalmedieval Facebook: http://www.facebook.com/group.php?gidI320313760 Discussion list: dm-l@uleth.ca Change list options: http://listserv.uleth.ca/mailman/listinfo/dm-l
Digital Medievalist -- http://www.digitalmedievalist.org/ Journal: http://www.digitalmedievalist.org/journal/ Journal Editors: editors _AT_ digitalmedievalist.org News: http://www.digitalmedievalist.org/news/ Wiki: http://www.digitalmedievalist.org/wiki/ Twitter: http://twitter.com/digitalmedieval Facebook: http://www.facebook.com/group.php?gidI320313760 Discussion list: dm-l@uleth.ca Change list options: http://listserv.uleth.ca/mailman/listinfo/dm-l