stripping logos from scanned PDF files
Moderator: kcleung
stripping logos from scanned PDF files
I have got the public domain (at least in US) "violin etudes" series published by CDSM. It prints its logo on every page of the scanned pdf files.
What is the quickest and cleanest way to remove the logos? I am using a Linux system. Can this process be automated (or at meast semi-automated)?
Thanks!
What is the quickest and cleanest way to remove the logos? I am using a Linux system. Can this process be automated (or at meast semi-automated)?
Thanks!
PDFSplitandMerge (remove encryption) ->GIMP (remove watermarks).
While you can automate the removal, it doesn't work very well because the images from which their PDFs are compiled aren't the same canvas size, so while the automated removal may work on the first 4, on the 5th it may remove noteheads, etc.
While you can automate the removal, it doesn't work very well because the images from which their PDFs are compiled aren't the same canvas size, so while the automated removal may work on the first 4, on the 5th it may remove noteheads, etc.
A good news:
You can just throw CDSM / Orchestra Musicians / Violin etude series pdfs directly at gimp 2.6.1 and it will look after itself (read the pdf, import the pages, erase the logos, convert from RGB to 1-bit BW and save as g4-compressed tiff files in one go.
Then after you get all pages done, in the directory, type:
tiffcp -c g4 *.tiff output.tiff
tiff2pdf output.tiff > foo.pdf
And it is done! I found out that by converting to BW, the new pdf file only takes a fraction of the space of the original file!
I have already tried the timpani part of the Berlioz Roman
You can now say good-bye to the proprietary PDFSplitandMerge software!
The process time is about 20 seconds per page. If the total number of pages of a work is 200 pages, then it would take a bit more than three man-hour to process a complete orchestral work (all movements), which is *much* faster than scanning the parts manually.
So Daphnis and others, would you be interested in joining the IMSLP-Orchestral Musician's series subproject? (see the " IMSLP Buying "Orchestra Musician's" and entering i........" thread in the "copyright related" forum) This proposed subproject focuses on cataloging copyright statuses in the entire Orchestra Musician's collection (all would be PD in USA) and catalog which libraries stock this series, and then draw a todo list which parts are still missing and needs to be uploaded.
You can just throw CDSM / Orchestra Musicians / Violin etude series pdfs directly at gimp 2.6.1 and it will look after itself (read the pdf, import the pages, erase the logos, convert from RGB to 1-bit BW and save as g4-compressed tiff files in one go.
Then after you get all pages done, in the directory, type:
tiffcp -c g4 *.tiff output.tiff
tiff2pdf output.tiff > foo.pdf
And it is done! I found out that by converting to BW, the new pdf file only takes a fraction of the space of the original file!
I have already tried the timpani part of the Berlioz Roman
You can now say good-bye to the proprietary PDFSplitandMerge software!
The process time is about 20 seconds per page. If the total number of pages of a work is 200 pages, then it would take a bit more than three man-hour to process a complete orchestral work (all movements), which is *much* faster than scanning the parts manually.
So Daphnis and others, would you be interested in joining the IMSLP-Orchestral Musician's series subproject? (see the " IMSLP Buying "Orchestra Musician's" and entering i........" thread in the "copyright related" forum) This proposed subproject focuses on cataloging copyright statuses in the entire Orchestra Musician's collection (all would be PD in USA) and catalog which libraries stock this series, and then draw a todo list which parts are still missing and needs to be uploaded.
That's about 15 seconds per page slower than my method (but using Photoshop). I couldn't begin to process the number of images I deal with on a daily basis at that speed.The process time is about 20 seconds per page.
None of the CDSM sets I've converted (I own 90% of the volumes) use RGB to store the images.convert from RGB to 1-bit BW
PDFSaM is free, open source, and built on Java.You can now say good-bye to the proprietary PDFSplitandMerge software!
I am personally much too busy with my own submissions (including my personal scanning projects and cleanup and submission of French CDSM sets) to take on anything else at this time.would you be interested in joining the IMSLP-Orchestral Musician's series subproject?
WOW. does it mean that it only takes 5 seconds to process one page? (including mouse clicks, mouse actions to erase the pages, mouse clicks for saving the image etc)? So do you do these kind of things via a script? (but you can't automate erasing logos sonce logos can be stored at different places.daphnis wrote:That's about 15 seconds per page slower than my method (but using Photoshop). I couldn't begin to process the number of images I deal with on a daily basis at that speed.The process time is about 20 seconds per page.
When I quote 20 seconds. It includes all manual work from opening the source file to having a processed pdf file ready. The limiting factor is how fast I can move my mouse, not the processing speed of the software.
Once I decrypt the PDFs I dump out all the images into a folder, load about 50 images at a time into photoshop, then use a combination of keyboard shortcuts and macros (droplets) to do everything from delete a selection, convert the image, rotate the canvas, correct for pixel aspect ratio, and of course to save the file. It takes about 2 mouse clicks and one keyboard stroke per image to edit and save. Once the image is opened it takes me probably at the most 5 seconds to edit. After every image is cleaned and adjusted, I re-compile them all back into a PDF.
-
- Site Admin
- Posts: 1139
- Joined: Sun Jan 14, 2007 8:16 am
- notabot: YES
- notabot2: Bot
- Location: Perth, Australia
- Contact:
It isn't a project yet! I am interested in getting it into a project, but it isn't! (and if it were to become one, something on this size would be a 'project' no a 'subproject')So Daphnis and others, would you be interested in joining the IMSLP-Orchestral Musician's series subproject?
Daphnis, is any of that (or similar) possible with gimp (do you know)?
Yes, it certainly is. It takes longer with Gimp (especially in saving), but is certainly possible. In fact, if I ever get around to finishing it, I'm going to publish here on IMSLP an all-new scanning guide that details the process and methods of scanning, image editing and submission with free software.Daphnis, is any of that (or similar) possible with gimp (do you know)?
-
- Site Admin
- Posts: 1139
- Joined: Sun Jan 14, 2007 8:16 am
- notabot: YES
- notabot2: Bot
- Location: Perth, Australia
- Contact:
I agree. Due to the sheer size, "incorporating OM's orchestral parts" would be a project of its own. However should this project get support from the Ottawa law school team to prevent the UE's Cease and Desist from happening again? Although the legal status of OM is probably much clearer than the UE case, we better play safe on a project of this size.
What do you think?
What do you think?
-
- Site Admin
- Posts: 1139
- Joined: Sun Jan 14, 2007 8:16 am
- notabot: YES
- notabot2: Bot
- Location: Perth, Australia
- Contact:
When IMSLP reopens, the Ottawa law school team pledges to supports it re-opening.
What I mean is that we inform the Ottawa law school about our plan, that's all. Since they are already supporting us, it would not cost any more (or anything at all) by informing them about the plan of the new project.
What I mean is that we inform the Ottawa law school about our plan, that's all. Since they are already supporting us, it would not cost any more (or anything at all) by informing them about the plan of the new project.
-
- Site Admin
- Posts: 1139
- Joined: Sun Jan 14, 2007 8:16 am
- notabot: YES
- notabot2: Bot
- Location: Perth, Australia
- Contact:
-
- Site Admin
- Posts: 2249
- Joined: Sun Dec 10, 2006 11:18 pm
- notabot: 42
- notabot2: Human
- Contact:
This is very interesting. I happen to own both the "Digital Beethoven Edition", the "Digital Bach Edition" and numerous other CDSM collections. Obviously, we wouldn't need much from the Bach edition, apart from replacing some bad files. We are missing a fair amount of the old Breitkopf Beethoven edition, though.
I wonder if we should consider setting up an FTP server where unlocked score files with logos present could be stored for processing. That way, with several people working together, a considerable number of titles could be processed using some of the methods outline above and ultimately added to the collection.
I wonder if we should consider setting up an FTP server where unlocked score files with logos present could be stored for processing. That way, with several people working together, a considerable number of titles could be processed using some of the methods outline above and ultimately added to the collection.
guide to process OM files in quickest way w/o proprietary sw
Here is the quickest and the simplest way to process files (strip logo from every page) from the Orchestra Musician's series under Linux:
1. Put each parts file in its own subdirectory and change to that subdirectory
2. run as one line:
gs -sDEVICE=tiffg4 -dNOPAUSE -r300 -dBATCH -sPAPERSIZE=a4 -sOutputFile=output_%04d.tiff foo.pdf
This would open up the pdf files and perform all the necessary conversions (300dpi 1-bit BW A4 sized tiff images)
3. Run GIMP, in the "open" window, go to the subdirectory, highlight 5-10 tiff files at a time and click "open". This way reduces numbers of required mouse activities (the limiting factor for processing speed)
4. Erase the logo with the eraser and close the file, make sure you click "save" when it asks you whether you want to save.
You only have to set up the eraser once and from now on, it takes *three clicks* (including the eraser action) to process each page, thus decreasing processing time of each file to 5 seconds!
5. at command prompt, change to the subdirectory and concatenate all processed images by running:
tiffcp -c g4 *.tiff output.tiff
6. finally we convert output.tiff back to pdf and send the pdf file back to the parent directory by:
tiff2pdf output.tiff > ../foo.pdf
Now it is all done! For a work that have 200-page of part pages altogether, it would take 20 minutes to process (including setting up directories etc) and if it also has a full score, then the whole thing (incl full score) should take no longer than 30 minutes.
Could you please revise this plan and if it also work out, then incorporate this guide into the new logo-stripping guide?
1. Put each parts file in its own subdirectory and change to that subdirectory
2. run as one line:
gs -sDEVICE=tiffg4 -dNOPAUSE -r300 -dBATCH -sPAPERSIZE=a4 -sOutputFile=output_%04d.tiff foo.pdf
This would open up the pdf files and perform all the necessary conversions (300dpi 1-bit BW A4 sized tiff images)
3. Run GIMP, in the "open" window, go to the subdirectory, highlight 5-10 tiff files at a time and click "open". This way reduces numbers of required mouse activities (the limiting factor for processing speed)
4. Erase the logo with the eraser and close the file, make sure you click "save" when it asks you whether you want to save.
You only have to set up the eraser once and from now on, it takes *three clicks* (including the eraser action) to process each page, thus decreasing processing time of each file to 5 seconds!
5. at command prompt, change to the subdirectory and concatenate all processed images by running:
tiffcp -c g4 *.tiff output.tiff
6. finally we convert output.tiff back to pdf and send the pdf file back to the parent directory by:
tiff2pdf output.tiff > ../foo.pdf
Now it is all done! For a work that have 200-page of part pages altogether, it would take 20 minutes to process (including setting up directories etc) and if it also has a full score, then the whole thing (incl full score) should take no longer than 30 minutes.
Could you please revise this plan and if it also work out, then incorporate this guide into the new logo-stripping guide?
Last edited by kcleung on Mon Nov 03, 2008 6:53 pm, edited 3 times in total.