Group: FSF:office volunteers/scanning old files

From LibrePlanet
< Group:FSF:office volunteers
Revision as of 12:27, 27 August 2019 by Craigt (talk | contribs) (Single scan Strategy (in progress 2019-05-01))
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Scanning old files is one of the ways to help out at the FSF office. This page contains the instructions for the task.

Scanning Old Files

The files the FSF is scanning are copyright assignments and employer/other disclaimers. These are the legal documents that people sign when they contribute to the GNU project. A disclaimer is permission given by an employer (or school) to let an employee assign their work to the FSF. Right now, the FSF has a nice big stack of them that need to be scanned into .pdf format, so that they can be found quickly when contacted about a possible issue. The FSF hired a company to scan them in all at once, but they did it in an disorganized fashion, and without splitting or naming files in a helpful way.

Single scan Strategy (in progress 2019-05-01)

    • CURRENTLY UNDER CONSTRUCTION**


1. Remove any tape, staple, or other from pages.

2. Try to make a batch scan with as many documents as possible.

 A. Take note of the paper size. Is it A4 or 11x8.5?
 B. Is the paper feeder suitable? (not tears, creases, stickers, etc.)
 C. 

3.

2. Make page as flat as possible.

3. Scan document to network share, duplex scan as needed.

4. Rename document by <contributor last name>.RT#(if applicable).GNU.PROJECT.pdf.

5. Move to appropriate year in archive.

6. Stack the document as neatly as possible to be returned to the vault.

Bulk scanning Strategy

The strategy is to scan in a stack of the presorted documents using the automated scanning machine in the back office. Then use a program called 'pdftk' to split up the digital scan into separate .pdf files named after each signer and the targeted GNU project.

Important

It is important to maintain the order of files at all times during this scanning, checking and refiling process. Some documents were originally stapled, but to scan them, they have their staples removed. Not all of these assignments list the person's name on every page, so if the documents get mixed up, that would pose an issue. In any case, documents should be in a partially alphabetic order, which means they do not need to be resorted, unless you are asked to.

Where to Start

This effort started with the oldest documents first, going from 'A' to 'Z' within each year. There is a vertical piece of paper that marks which position the previous volunteer left off at. It should also have the next letter written down for redundancy. The year that needs to be worked on should be written on a post-it note on one of the file drawers. If 1992 'R' was the last batch, then do 1992 'S'.

The files are in black metal file drawers in the room after the FSF store. They are mostly clustered by year, and sub-sorted by letter. However, the order of years is not sorted in an obvious way. The year you are looking for should be the one with the special post-it note attached. If the drawer you need to access is locked, ask craigt for help.

The Scanning Process

  • It is important that you remove every last staple from documents you put in the scanner.
  • Also, remove any tape holding two sheets together.
  • Remove or tape down any unimportant post-it notes.
  • Tape small documents (at the corners) onto individual 8.5 x 11 pieces of paper.

If you don't do these four things, the scanner will likely jam and tear the pages. If you think the note has important information not already contained in the document, you can tape the non-sticky side down with a piece of scotch tape so the note doesn't stick out. Please move the note so it isn't over an area of the page that already contains text. There may be room on the back for it. Make sure the naturally sticky side is secure.

This is how to operate the scanner:

  • Neatly stack the staple- and tape-removed pages face up with the top edge of the page pointing away from you, assuming the page is normally oriented. Put them in the autoloader at the top of the machine and adjust the plastic guides so that they sit firmly against the stack. If there are some slightly larger pages, you should adjust the guides a bit so that they fit without bending. Vertically center the the smaller pages, then slide the stack so that the left side is firmly against the machine entrance.
  • Next, enter your email address and scan like this:
    • Press 'Fax/Scan' button. Press '2'.
    • Touch 'Com. Mode', 'PC', then 'Email' and 'Enter'.
    • Type in your email address.
      • If you mistype, touch the left arrow, and then the 'delete' button to delete the selected character.
    • Press the Green start button.

You should watch the scanning process at least the first few times you do this, so that you can quickly interrupt it by pressing the pink 'Stop' button if there is a jam in the machine. Otherwise, stay close by.

The scanner will email you with the result. Save the file as temp.pdf and put it in the directory you will be using splitpdf in.

In Order to Split

Next, you need to split the pdf file with pdftk. Here is the source of the script to use when splitting files:

#!/bin/bash 
# Please make the file name the last name of the contributor (first letter capitalized), then the RT# if available, then name of the package in all caps including 'GNU' if necessary. Example: Hacker.134242.GNU.EMACS.pdf
# For employer disclaimers, add the word 'DISCLAIMER' after the package name. Example: Hacker.134242.DISCLAIMER.pdf

pdftk A=temp.pdf cat A1-2W output Lastname.PACKAGENAME.pdf


# make sure to add the last scanned document as a comment at the end of this file
  • Copy and paste it into a new file named splitpdf . After you save it, run the following command in the same directory as the file:
chmod +x splitpdf

Editing ./splitpdf

Once you've done that, edit the file and change the arguments to the first command, instructing pdftk to create a single file for the pages that form a single document.

  • The 'A=temp.pdf' part can be left unchanged. the 'A' is a variable that is used as a shorthand in the rest of the command, when referring to temp.pdf.
  • 'A1-2W' Tells pdftk to select pages 1 through 2 and to rotate them 'West' (90 degrees CCW). Since you will be scanning both sides of the documents and not all documents use the back side, you will usually need to skip that side, after checking to make sure there is nothing written there. In that case, the first two-page document would be selected as 'A1W 3W', without the dash. (Odd numbered pages select the front, and even numbered pages select the back. They usually need to be rotated because pages are scanned and saved by the scanner rotated 90 degrees.)
  • If document pages are out of order, then change the order of numbers you use in the script. You may re-order the physical pages that are part of that single document.
  • If two documents appear to be duplicates, then put them both in the same file. If they seem different, but the last name and program name are the same, then make the end of the files named -1.pdf, -2.pdf, etc. so that one file doesn't overwrite the other.
  • Change the output name, as instructed in the script above. If a document references multiple projects you can list them like 'Smith.GIMP-GLIB-EMACS.pdf'. Make sure to end the file name '.disclaimer.pdf' if it is a disclaimer. Also, sometimes a company will be the entity that assigns its own copyright, instead of assigning the copyright of its employees. In that case, the document is not a disclaimer. Put the company name in place of "Lastname", replacing spaces with dashes.

Then copy the first line and paste it on the next line, change the details for the next document, and repeat.

Splitting the File

./splitpdf

Once you've covered all of the documents, run the script and check the resulting pdfs against the stack you scanned. Check your work, make any fixes necessary, remove pdf files with incorrect names that will not be overwritten, then re-run the script. If some pages have a few letters cut off of the side of the page, that is fine. If there's a lot chopped off, you may want to re-scan that file without losing its place in the stack. Once everything is correct, make a comment at the end of your splitpdf file saying where you left off scanning. Since you will be emailing that file too, it's a way for Tedt to keep track of the last scanned section.

# I've finished scanning up through 1992 S - check this, I think we're up to 1994

Zipping it Up

Zip up the pdfs and splitpdf together. Delete temp.pdf first so that it isn't included in the archive.

rm temp.pdf
tar czf 1992S.tgz *.pdf splitpdf

Then email the .tgz file to Tedt. His address is his name AT fsf.org.

If you are going to work on another stack, you should move the files you were working on into a subdirectory, so that you don't mix all of the files together.

mkdir 1992S
mv *.pdf 1992S

Putting the Files Back

Please put files back where you found them, but on the other side of the paper divider that marked where the previous volunteer left off. Also cross the completed letter off the paper and write the letter for the next batch. If there are two or more people working on scanning at the same time, then make sure that the files go back in the correct order. (It can be kind of tricky.) Also, coordinate so there isn't any unnecessary duplication of work.

If you finished the last batch of documents from a given year, then change the listed year on the post-it note sitting on the face of the file cabinet. If the next year is in a different drawer, put the note on that drawer. Also move the paper partition (used to mark progress within a year) to the head of the next year, and leave a note that no documents have been scanned from that year.

Please don't lock the cabinet. In other words, don't push in the gray metal projection at the top right of the cabinet. That way, you and other volunteers can access them without John searching for the right key.