Page 1 of 5

stripping logos from scanned PDF files

Posted: Fri Oct 31, 2008 8:17 pm
by kcleung
I have got the public domain (at least in US) "violin etudes" series published by CDSM. It prints its logo on every page of the scanned pdf files.

What is the quickest and cleanest way to remove the logos? I am using a Linux system. Can this process be automated (or at meast semi-automated)?

Thanks!

Posted: Fri Oct 31, 2008 9:22 pm
by daphnis
PDFSplitandMerge (remove encryption) ->GIMP (remove watermarks).
While you can automate the removal, it doesn't work very well because the images from which their PDFs are compiled aren't the same canvas size, so while the automated removal may work on the first 4, on the 5th it may remove noteheads, etc.

Posted: Sat Nov 01, 2008 10:52 pm
by kcleung
A good news:

You can just throw CDSM / Orchestra Musicians / Violin etude series pdfs directly at gimp 2.6.1 and it will look after itself (read the pdf, import the pages, erase the logos, convert from RGB to 1-bit BW and save as g4-compressed tiff files in one go.

Then after you get all pages done, in the directory, type:

tiffcp -c g4 *.tiff output.tiff

tiff2pdf output.tiff > foo.pdf

And it is done! I found out that by converting to BW, the new pdf file only takes a fraction of the space of the original file!

I have already tried the timpani part of the Berlioz Roman

You can now say good-bye to the proprietary PDFSplitandMerge software!

The process time is about 20 seconds per page. If the total number of pages of a work is 200 pages, then it would take a bit more than three man-hour to process a complete orchestral work (all movements), which is *much* faster than scanning the parts manually.

So Daphnis and others, would you be interested in joining the IMSLP-Orchestral Musician's series subproject? (see the " IMSLP Buying "Orchestra Musician's" and entering i........" thread in the "copyright related" forum) This proposed subproject focuses on cataloging copyright statuses in the entire Orchestra Musician's collection (all would be PD in USA) and catalog which libraries stock this series, and then draw a todo list which parts are still missing and needs to be uploaded.

Posted: Sat Nov 01, 2008 10:59 pm
by daphnis
The process time is about 20 seconds per page.
That's about 15 seconds per page slower than my method (but using Photoshop). I couldn't begin to process the number of images I deal with on a daily basis at that speed.
convert from RGB to 1-bit BW
None of the CDSM sets I've converted (I own 90% of the volumes) use RGB to store the images.
You can now say good-bye to the proprietary PDFSplitandMerge software!
PDFSaM is free, open source, and built on Java.
would you be interested in joining the IMSLP-Orchestral Musician's series subproject?
I am personally much too busy with my own submissions (including my personal scanning projects and cleanup and submission of French CDSM sets) to take on anything else at this time.

Posted: Sat Nov 01, 2008 11:16 pm
by kcleung
daphnis wrote:
The process time is about 20 seconds per page.
That's about 15 seconds per page slower than my method (but using Photoshop). I couldn't begin to process the number of images I deal with on a daily basis at that speed.
WOW. does it mean that it only takes 5 seconds to process one page? (including mouse clicks, mouse actions to erase the pages, mouse clicks for saving the image etc)? So do you do these kind of things via a script? (but you can't automate erasing logos sonce logos can be stored at different places.

When I quote 20 seconds. It includes all manual work from opening the source file to having a processed pdf file ready. The limiting factor is how fast I can move my mouse, not the processing speed of the software.

Posted: Sat Nov 01, 2008 11:49 pm
by daphnis
Once I decrypt the PDFs I dump out all the images into a folder, load about 50 images at a time into photoshop, then use a combination of keyboard shortcuts and macros (droplets) to do everything from delete a selection, convert the image, rotate the canvas, correct for pixel aspect ratio, and of course to save the file. It takes about 2 mouse clicks and one keyboard stroke per image to edit and save. Once the image is opened it takes me probably at the most 5 seconds to edit. After every image is cleaned and adjusted, I re-compile them all back into a PDF.

Posted: Sun Nov 02, 2008 1:30 am
by Yagan Kiely
So Daphnis and others, would you be interested in joining the IMSLP-Orchestral Musician's series subproject?
It isn't a project yet! I am interested in getting it into a project, but it isn't! (and if it were to become one, something on this size would be a 'project' no a 'subproject')

Daphnis, is any of that (or similar) possible with gimp (do you know)?

Posted: Sun Nov 02, 2008 1:34 am
by daphnis
Daphnis, is any of that (or similar) possible with gimp (do you know)?
Yes, it certainly is. It takes longer with Gimp (especially in saving), but is certainly possible. In fact, if I ever get around to finishing it, I'm going to publish here on IMSLP an all-new scanning guide that details the process and methods of scanning, image editing and submission with free software.

Posted: Sun Nov 02, 2008 2:23 am
by Yagan Kiely
Sounds (very) interesting!
Can't wait.

Posted: Sun Nov 02, 2008 5:48 am
by kcleung
I agree. Due to the sheer size, "incorporating OM's orchestral parts" would be a project of its own. However should this project get support from the Ottawa law school team to prevent the UE's Cease and Desist from happening again? Although the legal status of OM is probably much clearer than the UE case, we better play safe on a project of this size.

What do you think?

Posted: Sun Nov 02, 2008 7:45 am
by Yagan Kiely
Legal matter's cost a lot.

Posted: Sun Nov 02, 2008 8:09 am
by kcleung
When IMSLP reopens, the Ottawa law school team pledges to supports it re-opening.

What I mean is that we inform the Ottawa law school about our plan, that's all. Since they are already supporting us, it would not cost any more (or anything at all) by informing them about the plan of the new project.

Posted: Sun Nov 02, 2008 9:06 am
by Yagan Kiely
We can get to that later, anyway.

Posted: Sun Nov 02, 2008 9:06 pm
by Carolus
This is very interesting. I happen to own both the "Digital Beethoven Edition", the "Digital Bach Edition" and numerous other CDSM collections. Obviously, we wouldn't need much from the Bach edition, apart from replacing some bad files. We are missing a fair amount of the old Breitkopf Beethoven edition, though.

I wonder if we should consider setting up an FTP server where unlocked score files with logos present could be stored for processing. That way, with several people working together, a considerable number of titles could be processed using some of the methods outline above and ultimately added to the collection.

guide to process OM files in quickest way w/o proprietary sw

Posted: Sun Nov 02, 2008 9:06 pm
by kcleung
Here is the quickest and the simplest way to process files (strip logo from every page) from the Orchestra Musician's series under Linux:

1. Put each parts file in its own subdirectory and change to that subdirectory

2. run as one line:

gs -sDEVICE=tiffg4 -dNOPAUSE -r300 -dBATCH -sPAPERSIZE=a4 -sOutputFile=output_%04d.tiff foo.pdf

This would open up the pdf files and perform all the necessary conversions (300dpi 1-bit BW A4 sized tiff images)

3. Run GIMP, in the "open" window, go to the subdirectory, highlight 5-10 tiff files at a time and click "open". This way reduces numbers of required mouse activities (the limiting factor for processing speed)

4. Erase the logo with the eraser and close the file, make sure you click "save" when it asks you whether you want to save.

You only have to set up the eraser once and from now on, it takes *three clicks* (including the eraser action) to process each page, thus decreasing processing time of each file to 5 seconds! :)

5. at command prompt, change to the subdirectory and concatenate all processed images by running:
tiffcp -c g4 *.tiff output.tiff

6. finally we convert output.tiff back to pdf and send the pdf file back to the parent directory by:
tiff2pdf output.tiff > ../foo.pdf

Now it is all done! For a work that have 200-page of part pages altogether, it would take 20 minutes to process (including setting up directories etc) and if it also has a full score, then the whole thing (incl full score) should take no longer than 30 minutes.

Could you please revise this plan and if it also work out, then incorporate this guide into the new logo-stripping guide?