So i was presented once again with a pdf of a spreadsheet printout WITH grid lines..this little gifty came from the city development office – it’s an inventory of historic houses in town done in 2007…the city has since not only lost the original spreadsheet version but they lost the pdf version..they did find the paper version and faxed it to me which turned it into a PDF. so far so good…now how do you get it BACK into a spreadsheet?
I have had this problem before…if you try running OCR against it, all the grid lines get interpreted as letters, lines underscores etc… basically it’s a train wreck. Transcription is actually faster..but there are 500+ plus lines in this document so i’m not that enthusiastic…
Strangely because i have been doing so much document manipulation for transcription the solution came to me in my sleep…PHOTOSHOP…(i recommend Photoshop Elements..it’s cheap…hell it’s free if you buy the right scanner)
1. Open the multi-page document in Photoshop, it will let you choose which page.
2. Rotate the page, if needed.
3. enlarge until you see a grid line clearly.
4. Using the magic wand, SELECT the grid line. this will automatically select all the connected lines.
5. DELETE Yes you will also loose any letters that are touching …i lost small ‘g’ and some ‘p’s
6. You will probably have ungrabbable ghost lines, use Menu–>Enhance->Adjust Lighting–>Brightness/Contrast. and crank the Brightness until the lines disappear..and the contrast until the letters darken up. Relax you just want the letters to be read by the OCR.
7. If needed SAVE AS …a TIFF file…Microsoft’s Imaging program likes .Tiff files for OCR…i like FreeOCR..correction i LOVE the program FreeOCR but it will open nearly anything, including TIFs and JPGs…
Results vary according to your scanning area.
Ideally you want it back in spreadsheet ready FIELD delimited form…
but since i am missing letters and will have to do a lot of proofing
this format may suffice which lets me insert commas where i want the fields to end:
3 Annis 716-3-25 1910 Tri Ie Decker Arlin ton x 2
4 Annis 716-2-4 1900 Queen Anne Vern Arlin ton x 3
6 Annis 716-2-5 1895 Arlin ton x Double worker housing 2
7 Annis 716-3-24 1895 Queen Anne Vern Arlin ton x 3
9 Annis 716-3-23 1895 Queen Anne Vern Arlin ton x 3
10 Annis 716-2-61895 1895 Arlin ton x Double worker housin 2
11 Annis 716-3-22 1895 Queen Anne Vern Arlin ton x 3
you can of course SELECT columns to Scan..and end up with something you can paste into a spreadsheet in sections:
Here endeth the lesson