Page 1 of 1

[FEAT] Count PDF Pages

Posted: Wed Sep 19, 2007 1:36 am
by horndude77
It would be nice not to have to manually input the number of pages in a PDF file. (It duplicates info already in the PDF). Here's a command to get the page count:

Code: Select all

pdftk sc01.pdf dump_data | grep NumberOfPages | awk '{print $2}'
Not a huge deal, but it would save a bit of work. Especially on those pages which don't yet have a page count.

Posted: Wed Sep 19, 2007 2:41 am
by imslp
Thanks for reminding me about this! I've actually thought about this a long long time ago, but forgot somewhere along the way. The current problem with that is the CPU usage (think of how long it'd take to run that command 10 times to generate a single page). Of course, the solution is to use a caching system, but I will have to write it :)

I may end up bundling this change with other PDF plans that I have... ;)

Posted: Wed Sep 19, 2007 4:20 am
by horndude77
Does wikimedia not handle caching of pages already? I'm not too familiar with its internals. If not I remember reading a while back about wikipedia using squid proxies to do some caching for them. Perhaps that's overkill though.

Another option is to only do it once when the file is added. This however still has the disadvantage that if the file is replaced the pages field can get out of sync. Or worse someone can change the page count field for fun. How are you handling the file size? Perhaps this could be handled the same way (store it off in a database I assume).

Thanks again for all the work you do!

Posted: Wed Sep 19, 2007 5:14 am
by imslp
horndude77 wrote:Does wikimedia not handle caching of pages already? I'm not too familiar with its internals. If not I remember reading a while back about wikipedia using squid proxies to do some caching for them. Perhaps that's overkill though.
Squid caching is way to grand for IMSLP right now ;) Usually there are dedicated squid caching servers (with a ton of memory), but right now everything about IMSLP is on one server, and the first step towards multi-server is usually not in the direction of squid caching (usually it is the separation of MySQL).

Mediawiki does handle caching, but caching works most efficiently for logged out users. And even with caching working optimally everything needs to be reparsed every three days, which would mean rather long wait-times if you are unlucky enough to be the one to trigger a reparse of a page with 10+ files on it ;) There is also the chance of timing out during the re-parse, and in the worst case scenario the page would time out every time, resulting in an in-accessible page. Of course this is rather extreme, but still...
Another option is to only do it once when the file is added. This however still has the disadvantage that if the file is replaced the pages field can get out of sync. Or worse someone can change the page count field for fun. How are you handling the file size? Perhaps this could be handled the same way (store it off in a database I assume).
Indeed... the difference is that Mediawiki handles file size caching, but of course not file page number caching... so this will have to be implemented by me :) And yes, this will be the way to do caching if it is to be done; and yes, there will be a *lot* of headaches along the way :/ Especially since I'm not particularly familiar with the Mediawiki image processing code (and even less so with the huge change to it in Mediawiki 1.11)... but we'll see.

Basically, this will be a feature that I will try to push in when I do other PDF file functions (there are a few I'm planning). :)