Group: FSF:office volunteers/scanning old files
m (→Editing ./splitpdf: rotating isn't always necessary) |
(→Splitting the File: mention pages with text cut off from the side) |
||
Line 80: | Line 80: | ||
<pre>./splitpdf</pre> | <pre>./splitpdf</pre> | ||
− | Once you've covered all of the documents, run the script and check the resulting pdfs against the stack you scanned. '''Check your work''', make any fixes necessary, remove pdf files with incorrect names that will not be overwritten, then re-run the script. Once everything is correct, '''make a comment''' at the end of your splitpdf file saying where you left off scanning. Since you will be emailing that file too, it's a way for Don to keep track of the last scanned section. | + | Once you've covered all of the documents, run the script and check the resulting pdfs against the stack you scanned. '''Check your work''', make any fixes necessary, remove pdf files with incorrect names that will not be overwritten, then re-run the script. If some pages have a few letters cut off of the side of the page, that is fine. If there's a lot chopped off, you may want to re-scan that file without losing its place in the stack. Once everything is correct, '''make a comment''' at the end of your splitpdf file saying where you left off scanning. Since you will be emailing that file too, it's a way for Don to keep track of the last scanned section. |
<pre># I've finished scanning up through 1992 S</pre> | <pre># I've finished scanning up through 1992 S</pre> |
Revision as of 14:38, 18 May 2012
Scanning old files is one of the ways to help out at the FSF office. This page contains the instructions for the task.
Contents
Scanning Old Files
The files the FSF is scanning are copyright assignments and employer/other disclaimers. These are the legal documents that people sign when they contribute to the GNU project. A disclaimer is permission given by an employer (or school) to let an employee assign their work to the FSF. Right now, the FSF has a nice big stack of them that need to be scanned into .pdf format, so that they can be found quickly when contacted about a possible issue. The FSF hired a company to scan them in all at once, but they did it in an disorganized fashion, and without splitting or naming files in a helpful way.
Strategy
The strategy is to scan in a stack of the presorted documents using the automated scanning machine in the back office. Then use a program called 'pdftk' to split up the digital scan into separate .pdf files named after each signer and the targeted GNU project.
Important
It is important to maintain the order of files at all times during this scanning, checking and refiling process. Some documents were originally stapled, but to scan them, they have their staples removed. Not all of these assignments list the person's name on every page, so if the documents get mixed up, that would pose an issue. In any case, documents should be in a partially alphabetic order, which means they do not need to be resorted, unless you are asked to.
Where to Start
This effort started with the oldest documents first, going from 'A' to 'Z' within each year. Ask Don where the previous volunteer has left off. If 1992 'R' was the last batch, then do 1992 'S'. The files are in black metal file drawers in the room after the FSF store. They are mostly clustered by year, and sub-sorted by letter. However, the order of years is not sorted in an obvious way. If the drawer you need to access is locked, ask Don or Jasimin for help.
The Scanning Process
- It is important that you remove every last staple from documents you put in the scanner.
- Also, remove any tape holding two sheets together.
- Remove or tape down any unimportant post-it notes.
- Tape small documents (at the corners) onto individual 8.5 x 11 pieces of paper.
If you don't do these four things, the scanner will likely jam and tear the pages. If you think the note has important information not already contained in the document, you can tape the non-sticky side down with a piece of scotch tape so the note doesn't stick out. Please move the note so it isn't over an area of the page that already contains text. There may be room on the back for it. Make sure the naturally sticky side is secure.
This is how to operate the scanner:
- Neatly stack the staple- and tape-removed pages face up with the top edge of the page pointing away from you. Put them in the autoloader at the top of the machine and adjust the plastic guides so that they sit firmly against the stack. If there are some slightly larger pages, you should adjust the guides a bit so that they fit without bending. Vertically center the the smaller pages, then slide the stack so that the left side is firmly against the machine entrance.
- Next, enter your email address and scan like this:
- Press 'Fax/Scan' button. Press '2'.
- Touch 'Com. Mode', 'PC', then 'Email' and 'Enter'.
- Type in your email address.
- If you mistype, touch the left arrow, and then the 'delete' button to delete the selected character.
- Press the Green start button.
You should watch the scanning process at least the first few times you do this, so that you can quickly interrupt it by pressing the pink 'Stop' button if there is a jam in the machine. Otherwise, stay close by.
The scanner will email you with the result. Save the file as temp.pdf
and put it in the directory you will be using splitpdf
in.
In Order to Split
Next, you need to split the pdf file with pdftk. Here is the source of the script to use when splitting files:
#!/bin/bash # Please make the file name the last name of the contributor (first letter capitalized), then a period, then the name of the package in all caps. # For employer disclaimers, add the word 'disclaimer' after the package name pdftk A=temp.pdf cat A1-2W output Lastname.PACKAGENAME.pdf # make sure to add the last scanned document as a comment at the end of this file
- Copy and paste it into a new file named
splitpdf
. After you save it, run the following command in the same directory as the file:
chmod +x splitpdf
Editing ./splitpdf
Once you've done that, edit the file and change the arguments to the first command, instructing pdftk to create a single file for the pages that form a single document.
- The 'A=temp.pdf' part can be left unchanged. the 'A' is a variable that is used as a shorthand in the rest of the command, when referring to temp.pdf.
- 'A1-2W' Tells pdftk to select pages 1 through 2 and to rotate them 'West' (90 degrees CCW). Since you will be scanning both sides of the documents and not all documents use the back side, you will usually need to skip that side, after checking to make sure there is nothing written there. In that case, the first two-page document would be selected as 'A1W A3W', without the dash. (Odd numbered pages select the front, and even numbered pages select the back. They usually need to be rotated because pages are scanned and saved by the scanner rotated 90 degrees.)
- If document pages are out of order, then change the order of numbers you use in the script. You may re-order the physical pages that are part of that single document.
- Change the output name, as instructed in the script above. If a document references multiple projects you can list them like 'Smith.GIMP-GLIB-EMACS.pdf'. Make sure to end the file name '.disclaimer.pdf' if it is a disclaimer. Also, sometimes a company will be the entity that assigns its own copyright, instead of assigning the copyright of its employees. In that case, the document is not a disclaimer. Put the company name in place of "Lastname", replacing spaces with dashes.
Then copy the first line and paste it on the next line, change the details for the next document, and repeat.
Splitting the File
./splitpdf
Once you've covered all of the documents, run the script and check the resulting pdfs against the stack you scanned. Check your work, make any fixes necessary, remove pdf files with incorrect names that will not be overwritten, then re-run the script. If some pages have a few letters cut off of the side of the page, that is fine. If there's a lot chopped off, you may want to re-scan that file without losing its place in the stack. Once everything is correct, make a comment at the end of your splitpdf file saying where you left off scanning. Since you will be emailing that file too, it's a way for Don to keep track of the last scanned section.
# I've finished scanning up through 1992 S
Zipping it Up
Zip up the pdfs and splitpdf together.
tar czf 1992S.tgz *.pdf splitpdf
Then email the .tgz file to Donald. His address is his name AT fsf.org.
Putting the Files Back
Please put files back where you found them. Do not lock the cabinets unless instructed. In other words, don't push in the gray metal projection at the top right of the cabinet. That way, you and other volunteers can access them without John searching for the right key.