Wednesday, April 15, 2015

Data Mining a 213272 Record Tax Roll

SIMPLE PROBLEM?  In March 2015 -- an oil & gas client asked for targeted real estate analysis of a small community development in east Texas.  The Goal?  Simple enough to state: Construct a "community directory" for some local landowner outreach and potential permit acquisition.

FOCUS? REGION OF INTEREST (ROI) -- My client is generally good at area / distance / point measurements via Google Earth -- and sent a KMZ outline of the area of interest.  
Screenshot of Google Earth Startup Screen

Site drill down -- and visual inspection of the KMZ in Google Earth indicated only a few hundred lots and parcels were in play.  Task seemed simple -- just match up the Parcel IDs with county appraisal Tax Roll records -- do a GIS JOIN (GeoLink) of shapefile parcels and tax roll records -- and ROI (spatially) extract the landowner names and addresses.

Well surprise! Surprise! -- as Gomer might say.  Not so simple!  The data mining adventure details follow ...


Map of East Texas

DATA POLITICAL SETTING -- The east Texas county in question apparently hired a new GIS crew to "bring in house" the parcel quilt and appraisal tax roll maintenance -- claw back the data from an external contractor.  From phone discussions with the new county appraisal GIS alpha-hero, it was clear that the parcel and tax roll were not exactly "ready for prime time" -- and would require some fuzzy logic to match shapefile Parcels IDs to tax roll records.

PARCEL and TAX ROLL DATA -- Website downloads of "refreshed" county tax appraisal data was over 109 megabytes for the ZIP compressed tax roll (Wow!) -- and 87 megabytes for the ZIP compress parcel shapefiles (again, Wow!).  Why "Wow"?  ZIP compression can have a 20:1 "squash" ratio -- so it was possible I was looking at some (109+87)*20 => 3920 megabytes of data to "mine" for just a couple of hundred landowners.  If the data be clean -- all would be well.  If the data not be clean -- lots of low grade data mining ahead.  Powerful processing tools required.  Sorting the economically bad "Gangue" from the valuable "Ore" -- as the differences are known in hard rock mining.


Hard Rock Mining in PA circa 1908
(from Wikipedia)

MALICIOUS COMPLIANCE or Just Complex from a Mixed History? -- Sometimes when an external contractor is taken out of the work loop by a sponsor -- the contractor will sometimes manifest drop resentment  -- and will meticulously "dump all data" -- with the intent to "show them" how much the contractor is really needed.  And sometimes the data sets are really "just that complex" -- because of past tax records analysis & reference demands -- developed over many years -- by many folks -- with differing goals.  

It is almost impossible -- as an outside data miner and seeker of data value -- to determine if a data suite is "malicious compliance" -- or just complicated because of a long development history.  Either way -- there is much data forensics, statistical inspections -- and general bulk familiarization ahead -- when a large GIS data suite is downloaded -- more or less -- cold.

TAX ROLL METADATA -- Un-compress inspections of the tax roll (via command line "unzip -l tax_roll_2015.zip") indicated there 17 "flat text" ASCII files -- for a total of 649.1 megabytes.  Additionally -- the flat text files were supplemented with a "metadata" XLSX spreadsheet -- that seemed to lay out the data schema and structures of the 17 flat text files.  If the XLSX schema was accurate -- this would save great time and cost pain -- "pulling out" the few hundred landowners desired by my client -- for his target KMZ area.


Map Geometry in shapefiles
(from Wikipedia)

COUNTY TAX PARCEL SHAPEFILES -- were "standard" ESRI shapefiles -- and were generally well behaved in various GIS software systems ( ArcGIS 9.x / ArcView 3.x / QGIS / GlobalMapper / etc. ).  And the magic GeoLink column [PID] -- Parcel ID -- was well populated -- and [PID] histogram counts indicated a strong uniqueness across the entire county parcel quilt of 200K+ parcels.  Very Good.


IBM Punch Card (record) w/ Implicit Hollerith 
Columns (from Wikipedia)

TAX ROLL with Landowner Names -- More problematic -- 213,272 "Rows" by 8602 "Columns" ??  The flat ASCII text file that contain the highest odds of "real" landowner data was called "APPR_INFO.TXT" -- which contained 213,272 UNIX type newline delimited records -- and each record possessed rigorous fixed width records of 8602 bytes.  

The XLSX metadata and layout matrix indicated that the columns were NOT DELIMITED  -- no tabs / no spaces / no commas / no semi-colons / no other CSV column indicator tokens -- between columns.   Just implicit column splits -- at character (byte) boundaries -- indicated thus (for brevity, only 17 key columns illustrated):

FieldName Start  End  Length Notes
Prop_ID
1
12
12
Property ID as Int(12)
Prop_CD
13
17
5
Property Code as Char(5)
Prop_YR
18
22
5
Property Year as Numeric(5)
<skip>
...
...
...
<ignore a bunch of cols>
Owner_Name
609
678
70
Landowner name, Char(70)
<skip>
...
...
...
<ignore some more cols>
Addr01
694
753
60
Tax addr line01, Char(60)
Addr02
754
813
60
Tax addr line02, Char(60)
Addr03
814
873
60
Tax addr line03, Char(60)
Adr_City
874
923
50
Tax addr City, Char(50)
Adr_State
874
923
50
Tax addr City, Char(50)
Adr_Country
974
978
5
Tax addr City, Char(5)
Addr_Zip5
979
983
5
Tax Zip5 Base, Char(5)
Addr_Zip4
984
987
4
Tax Zip4 Suffix, Char(4)
<skip>
...
...
...
<ignore some more cols>
Legal
1150
1404
255
Legal Description, Char(255)
<skip>
...
...
...
<ignore some more cols>
Legal_Acres
1660
1675
16
Legal Tax Acreage
<skip>
...
...
...
<ignore some more cols>
Assessed
1946
1960
15
Tax Value, Numeric(15)
<skip>
...
...
...
<ignore some more cols>
Deed Num
5358
5407
50
Deed Number, Char(50)
<skip>
...
...
...
<ignore some more cols>
Intra_Alloc
8588
8602
15
IntraState Allocation,
char(15)

DIFFICULT DATA IMPORT?  Fixed-width and implicit column data import is a very FORTRAN Hollerith "program card" like machine reading problem.  This means that no "spreadsheet like" software -- or database software -- could easily "swallow" the flat text file "whole" -- in one import gulp.  Other tools were required.


"Program Card" used to delimit fixed width data field
via Hollerith formats for keypunch machines, 
circa 1949 (from wikipedia)

CHALLENGE -- Humm ... how to move the fixed width -- and Hollerith like -- tax roll into a format understood by the various GIS software systems.  Would be best to move the flat text file to DBF / ODB / MDB / XLSX / CSV -- some "well known" file format to GIS software systems.

Olde Fortran Malt Liquor
Logo from Futurama

FORTRAN HOLLERITH RECORDS?  Yes, one of the advantages of being in the data mining and data value extraction biz for many years -- I have seen some of the really old stuff -- that "set the tone" -- and "established a trajectory" for many modern datasets and software systems.  Fortran data processing was one of these early trend setters.

Depending upon the computer system supporting Ye-Old-Fortran -- Binary / ASCII / EBCDIC / whatever early Character Encoding -- the Fortran data was often stored "on disk" (often tape or shoe boxes of punch cards) -- with implicit column boundaries for the "punch card" record.  Sort of like a spreadsheet with no column splits the eyeball can see.  

Why implicit columns?  Why no delimiters?  Because memory and disk storage was so ghastly expensive in the 1960s and 1970s -- equivalent to millions of dollars per megabyte in 2015 dollars -- no data "space" was "wasted" on column or field delimitation -- no commas / spaces / tabs / semi-colons / whatever.  Your program and software just needed to "count columns carefully" as data was read-in (program card) -- and break out the columns by counts.  Old Fortran had a powerful format tool for implicit column handling -- called HOLLERITH format (see wiki page).

MIMIC OLD FORTRAN HOLLERITH IN MODERN SOFTWARE?  

This was a key question.  Most "modern" programming languages -- which could be used to convert the fixed width flat ASCII file -- to a "GIS format" -- are not "naturally" "tuned" for Hollerith type conversions.  Fortran was not an option (no compiler) -- and would be a big step backward.  The trick was how to mimic Hollerith "program cards" in a "modern" programming language -- like C, Python, Java, SQLITE or other.  



SQLITE as Conversion Tool?  Yes, It is possible -- yet tricky to mimic Hollerith format in database SQLITE -- using SQLite SUBSTR() functions.  Not graceful -- difficult to debug -- yet possible.  And would "sort of" dead-end in an SQLITE DB file -- and require "re-export" into CSV format.  Ouch!  Lots of places for the data conversion to "go off the rails."


C, Python, Java?   As Conversion Tools?  Yes, also possible -- more control than SQLITE -- yet the prospect of "byte slicing" and re-formatting many fixed width columns -- and doing all the column index accounting in "N" vs "N-1" count space was not very appealing.  Besides, my client was budget limited -- and writing a bunch of code becomes time and labor expensive -- fast.

Can You GWAK?

GNU AWK?  Could the GNU version of AWK programming language handle rigorous fixed width flat ASCII text files?  The idea was appealing. 

Back in the mid 1990s -- I had grown very tired of writing yet-another-FORTRAN-data-converter for yet-another-context-switch-of-field-geophysics-data -- and in the process of learning UNIX and C -- I stumbled across the most excellent and still relevant 1984 book The UNIX Programming Environment -- which contained a very short and sweet intro to AWK programming language. 

1984 Book from Kerrighan & Pike
with AWK Intro

As a mathematician and physics guy -- I was thunderstruck with the simplicity of AWK -- how AWK assumes -- as a matter of program invocation (start-up) -- that a record based file will be processed -- line by line -- and herds of file statistics / parameters / metrics were automagically "just there" for the programmer to reference by variable name.  No programmer setup-up funny biz like (a) opening a file, (b) setting up loops to scan every file records, (c) splitting records by some delimiter, (d) collecting stats & counts like record numbers and column limits, (e) etc.  

In AWK -- all of this routine funny biz was INCLUDED -- built into AWK -- just by invocation of AWK -- and opening a record based file (generally delimited by newlines).  Amazing.  

Learning to program with AWK revolutionized my data mining productivity.  And allow me to tackle geophysics files with 300K to 900K records (remember, this was the mid 1990s --when a 600 megabyte disk drive was "huge").

So the question for my client -- and the east Texas tax roll extraction -- was, could I "slice and dice" a 8602 bytes-per-record file -- a file with over 200K records -- with AWK -- or the GNU version called GAWK?  And do so without a bunch of GAWK SUBSTR() function slice and dice?

Use "The Force" of a Search Engine, Luke!

Good News!  It has been my experience that as soon as I have some question pop into mind -- probably 10,000 other folks have the same problem -- and at least 2-3 of those folks made a web page / blog.  After some web research -- and drilling thru different problem statement verbal bundles and nomenclatures -- I was thrilled to discover that GAWK contained a well defined mechanism to slice-and-dice fixed width column files.  Enter the GAWK "built-in" FIELDWIDTHS variable.

How does the FIELDWIDTHS variable work in GAWK?  Surprising much like the old Fortran Hollerith "Program Card." 

First some short details regarding AWK / GAWK, then how the FIELDWIDTHS variable "blows away" fixed width files -- and empowers more rapid data mining.

Every AWK / GAWK program has one to three basic "sections" -- BEGIN Block -- DO Every Line Block -- END Block -- each of which is optional.  Program structure is roughly -- and this "real" example will count the records contained in an input file (note that a "hashtag" makes everything after a comment non-executable):


BEGIN {
       # --- do a bunch of stuff before read 1st record ---
       recknt = 0   # -- set record count to zero
      }
{
 # --- do a bunch of stuff for every records of input file ---
 recknt = recknt + 1  # -- update record count
}
END { 
     # --- summarize what was found across all records ---
     print recknt
    }

FOR the 8602 byte tax roll file -- the FIELDWIDTHS variable almost exactly tracked the XLSX metafile schema -- specifically the [Length] indicator column:

BEGIN { 
  FIELDWIDTHS = "12 5 5 12 2 10 500 50 12 70 1 12 2 60 60 60 50 50 5 5 4 2 1 1 20 1 27 10 50 10 30 10 255 255 16 10 10 50 50 15 15 15 15 15 15 15 15 15 15 15 20 1 12 20 20 25 12 70 50"

  } 

Large? Yes?  Cumbersome?  Seems so -- yet NO!  The GAWK FIELDWIDTHS "template" variable is much less grief than doing all the character index accounting in C or Python or Java.  And none of that N vs N-1 difference between char counts that begin with "0" (C / Java / Python) verses those that start with "1" (Fortran and GAWK).

SUCCESS!! -- Almost on the 1st try.  With two or three edit updates to the GAWK filter script -- the very fat appraisal tax roll with 8602 "columns" -- and 213,272 "rows" -- was gracefully slice and diced into a CSV file for GIS and database import.  And cross platform tests on Win32 and Linux generated identical results.  The GAWK script with FIELDWIDTHS was up and working in less than four hours -- cross platform!  Outstanding!  And saved my client lots of time and expense drilling down to the desired answer -- a contact list of landowners.

SUMMARY? UpShot? -- Next time you need to "slice-and-dice" a large fixed width flat text file -- please do consider GAWK.  Works on Windows / UNIX / MacOS -- many platforms -- and will dramatically increase you data mining and analysis productivity.