Page 1 of 2

Program to strip CDSM logos

Posted: Sat Aug 08, 2009 12:21 am
by Mazin
A colleague and I cooked up a program to strip CDSM logos only from the CDSM scans.
Written in C++, it can detect and white-out two different size logos: the standard logo and a slightly smaller version. It's decently fast. Using a quad-core processer, I was able to process 61 pages in 23 secs.
Source code is provided as GPL. Download from
http://files.aztekera.com/music/magic20090807.zip

Compare
ftp://imslp.org/logo_infested/CDSM/Cell ... 0cello.pdf

to the automated result
http://files.aztekera.com/music/bach.tiff

But what's that you say? My program can't even glob filenames? That's why I also cooked up parallelize.pl, a perl script that can parallelize any terminal command!

parallelize.pl

Code: Select all

#!/usr/bin/perl
#
# I, Eric Jiang, hereby release this script into the public domain.
#
# This script requires the Parallel::ForkManager module to run.
#
use Parallel::ForkManager;

if(scalar(@ARGV) < 3) {
	print <<USAGE;
USAGE: perl parallelize.pl num cmd glob

"GLOB" is substituted with the name of each file

Example:
	parallelize.pl 4 "mogrify -negate GLOB" "lol*.pbm"
USAGE
exit;
}

my @globs = <$ARGV[2]>;
my $pm = new Parallel::ForkManager($ARGV[0]);

for (my $i = 0; $i < scalar(@globs); $i++) {
	$pm->start and next;

	my $command = $ARGV[1];
	my $currfile = $globs[$i];
	$command =~ s/GLOB/$currfile/;

	print($command . "\n");
	system($command);

	$pm->finish;
}
$pm->wait_all_children;
print "Finished ".scalar(@globs). " tasks.\n";

Re: Program to strip CDSM logos

Posted: Sat Aug 08, 2009 2:10 am
by Mazin
Oh, and some usage notes for my and others' future reference.

Given cdsmrm and parallelize.pl in $PATH, the workflow is something like:

Extract images from PDF: (caution: some PDFs are not so simple, such as where CDSM retypeset pages!)

Code: Select all

pdfimages BestSongEver.pdf bse
This extracts all images to bse-000.pbm, bse-001.pbm, bse-002.pbm, etc. Then, remove the CDSM logo from each page parallelized for a six-core (6 processor) machine:

Code: Select all

parallelize.pl 6 "cdsmrm GLOB" "bse-*.pbm"
Convert images to TIFF in parallel, and compress in the process:

Code: Select all

parallelize.pl 6 "mogrify -format tiff -compress Group4 GLOB" "bse-*.pbm"
Compile the TIFFs into one TIFF:

Code: Select all

tiffcp bse-*.pbm BestSongEver_cleaned.tiff
Optionally convert to PDF:

Code: Select all

tiff2pdf BestSongEver_cleaned.tiff > BestSongEver_cleaned.pdf
Notice that there's no human intervention (although I suggest proofreading!), so theoretically, you could put all this in a batch script and run the batch script against a glob, say, *.pdf, come back in the morning, and proofread to your heart's content.

BTW, there's no guarantee that my program won't detect the logo to be in the center of the page and create a large white rectangle in the middle of a system. Hasn't happened to me yet, but just sayin'...

Re: Program to strip CDSM logos

Posted: Sat Aug 08, 2009 3:10 am
by Mazin
Booted into Windows, so now I have a win32 build:
http://files.aztekera.com/music/cdsmrm20090807win32.zip

Should run without any additional magic. However, you cannot drag'n'drop multiple files onto the exe on account my failure to include globbing. For globbing, you can use parallelize.pl from my first post, or use the linear-only cdsmrm.bat (same directory as cdsmrm.exe, and you can drag'n'drop multiple PBMs onto it):

cdsmrm.bat

Code: Select all

for %%i in (%*) do cdsmrm.exe %%i

Re: Program to strip CDSM logos

Posted: Sat Aug 08, 2009 7:56 pm
by daphnis
Can you describe (in prose) how its detection method works?

Re: Program to strip CDSM logos

Posted: Sat Aug 08, 2009 9:11 pm
by Mazin
daphnis wrote:Can you describe (in prose) how its detection method works?
Naively. It looks for a hardcoded linear pattern that should only match the logo. Since the logos were added after they scanned them, it works (as far as I can tell).

Re: Program to strip CDSM logos

Posted: Sun Aug 09, 2009 12:20 am
by Carolus
Oustanding. Have you done any complete scores of those resident on the FTP server?

Re: Program to strip CDSM logos

Posted: Sun Aug 09, 2009 5:15 am
by Mazin
Blah. Rewrote the search algorithm to be less dumb, because apparently I wasn't paying attention in my algorithms course or we didn't cover string searching. It's now three times faster.

New version is at http://files.aztekera.com/music/magic20090809.zip
Carolus wrote:Oustanding. Have you done any complete scores of those resident on the FTP server?
Only two right now (one is linked to in my first post), since I'm still testing. Umm... bursting and then creating a new PDF loses all of the bookmarks and metadata. What should I do about those? Is there a way to export bookmarks and metadata from the original PDF and then import them into the new PDF?

Re: Program to strip CDSM logos

Posted: Sun Aug 09, 2009 5:41 am
by Mazin
And what's the system for keeping track of which ones need to be processed and which ones are cleaned? Is there a "cleaned" directory I need to copy/move them to or something like that?

Re: Program to strip CDSM logos

Posted: Sun Aug 09, 2009 7:24 pm
by Mazin
So using a script that basically does the procedure outlined in my second post, my computer processed the contents of "CelloCD Works" (about 135 PDFs... there is going to be a very happy cellist soon :lol:) in about 30 minutes. However, it's not problem-free. Once in a while, a lone page will be exported as a PPM instead of a PBM, and thus be left out of the cleaned PDF (usually the first page. happened about 2 or 3 times.), but the presence of a PPM file is a dead giveaway. A bigger issue is in cases where CDSM retypeset entire pages to put the cello part by itself instead of with piano:

ftp://imslp.org/logo_infested/CDSM/Cell ... 0piano.pdf
ftp://imslp.org/logo_infested/CDSM/Cell ... 0Piano.pdf

Again, the presence of PPM files indicates that things were omitted.

Sorry for all the questions, but I still don't know what to do with these files now.

Re: Program to strip CDSM logos

Posted: Tue Aug 11, 2009 2:02 am
by Mazin
buuuuuuuuuump...

Re: Program to strip CDSM logos

Posted: Tue Aug 11, 2009 2:12 pm
by imslp
If there isn't a logo_cleaned or similar directory, you can create one. :-)

Re: Program to strip CDSM logos

Posted: Tue Aug 11, 2009 2:53 pm
by Mazin
Any guidelines or best practices to keeping bookmarks and/or metadata? Or do we not care?

Re: Program to strip CDSM logos

Posted: Wed Aug 12, 2009 6:36 am
by Carolus
We don't like to keep metadata. The bookmarks are OK as long as there's nothing proprietary included. BTW, I downloaded your cleaned files and they came up as "damaged and unreadable". I use Acrobat Pro version 8 (Mac OS).

Re: Program to strip CDSM logos

Posted: Wed Aug 12, 2009 3:21 pm
by Mazin
Carolus wrote:We don't like to keep metadata. The bookmarks are OK as long as there's nothing proprietary included. BTW, I downloaded your cleaned files and they came up as "damaged and unreadable". I use Acrobat Pro version 8 (Mac OS).
Try again. I had an issue where my WiFi adapter would overheat from transmitting data and disconnect, so quite a few files were only partially uploaded. I then started the upload again from a hardwired computer set to replace any file for which the server-side and local file differed in size, which should have replaced all the damaged files. Let me know if any are still damaged!

Re: Program to strip CDSM logos

Posted: Thu Aug 13, 2009 3:09 am
by Carolus
Just downloaded the Bach titles from the cleaned folder. Wonderful - fantastic job!! Now you'll have to try your hand at the volumes of the Digital Bach Edition which are there. There's the DBE logo to contend with, possibly some CDSM logos as well, as well as the ususal metatags, etc. If you like a real challenge, try taking on some Google scores. We have a number of those on the wiki already, with the Google logos removed. The problem is with those annoying random sections where a page or even a portion of a page are done at 150 dpi grayscale while the rest of the document is 600 dpi monochrome - much better for printing. The mixture of grayscale and monochrome on the same page causes some printers to crash when attempting to print these.