Python Fingerprint Example
======================================================================

Python is an easy-to-use language for running data analysis. To
demonstrate this, we will implement one of the NIST Big Data Working
Group case studies: matching fingerprints between sets of probe and
gallery images.

In order for this to run, you'll need to have installed the `NIST
Biometric Image Software
(NBIS) <http://www.nist.gov/itl/iad/ig/nbis.cfm>`__ and Sqlite3. You'll
also need the Python libraries ``numpy``, ``scipy``, ``matplotlib``.

The general application works like so:

#. Download the dataset and unpack it
#. Define the sets of probe and gallery images
#. Preprocess the images with the ``mindtct`` command from NBIS
#. Use the NBIS command ``bozorth3`` to match the gallery images to each
   probe image, obtaining an matching score
#. Write the results to an sqlite database

To begin with, we'll define our imports.

First off, we use the print function to be compatible with Python 3

.. code:: python

    from __future__ import print_function

Next, we'll be downloading the datasets from NIST so we need these
libraries to make this easier:

.. code:: python

    import urllib
    import zipfile
    import hashlib

We'll be interacting with the operating systems and manipulating files
and their pathnames.

.. code:: python

    import os.path
    import os
    import sys
    import shutil
    import tempfile

Some general usefull utilities

.. code:: python

    import itertools
    import functools
    import types

Using the ``attrs`` library provides some nice shortcuts to define
featurefull objects

.. code:: python

    import attr

We'll be randomly dividing the entire dataset, based on user input, into
the probe and gallery sets

.. code:: python

    import random

We'll need these to call out to the NBIS software. We'll also be using
multiple processes to take advantage of all the cores on our machine.

.. code:: python

    import subprocess
    import multiprocessing

As for plotting, we'll use ``matplotlib``, though there are many other
alternatives you may choose from.

.. code:: python

    import matplotlib.pyplot as plt
    import pandas as pd
    import numpy as np

Finally, we'll write the results to a database

.. code:: python

    import sqlite3

Utility functions
-----------------

Next we'll define some utility functions.

.. code:: python

      def take(n, iterable):
          "Returns a generator of the first **n** elements of an iterable"
          return itertools.islice(iterable, n )


      def zipWith(function, *iterables):
          "Zip a set of **iterables** together and apply **function** to each tuple"
          for group in itertools.izip(*iterables):
              yield function(*group)


      def uncurry(function):
          "Transforms an N-arry **function** so that it accepts a single parameter of an N-tuple"
          @functools.wraps(function)
          def wrapper(args):
              return function(*args)
          return wrapper


      def fetch_url(url, sha256, prefix='.', checksum_blocksize=2**20, dryRun=False):
          """Download a url.

          :param url: the url to the file on the web
          :param sha256: the SHA-256 checksum. Used to determine if the file was previously downloaded.
          :param prefix: directory to save the file
          :param checksum_blocksize: blocksize to used when computing the checksum
          :param dryRun: boolean indicating that calling this function should do nothing
          :returns: the local path to the downloaded file
          :rtype: 

          """

          if not os.path.exists(prefix):
              os.makedirs(prefix)

          local = os.path.join(prefix, os.path.basename(url))

          if dryRun: return local

          if os.path.exists(local):
              print ('Verifying checksum')
              chk = hashlib.sha256()
              with open(local, 'rb') as fd:
                  while True:
                      bits = fd.read(checksum_blocksize)
                      if not bits: break
                      chk.update(bits)
              if sha256 -- chk.hexdigest():
                  return local

          print ('Downloading', url)

          def report(sofar, blocksize, totalsize):
              msg = '{}%\r'.format(100 * sofar * blocksize / totalsize, 100)
              sys.stderr.write(msg)

          urllib.urlretrieve(url, local, report)

          return local

Dataset
-------

We'll now define some global parameters.

First, the fingerprint dataset.

.. code:: python

    DATASET_URL = 'https://s3.amazonaws.com/nist-srd/SD4/NISTSpecialDatabase4GrayScaleImagesofFIGS.zip'
    DATASET_SHA256 = '4db6a8f3f9dc14c504180cbf67cdf35167a109280f121c901be37a80ac13c449'

We'll define how to download the dataset. This function is general
enough that it could be used to retrieve most files, but we'll default
it to use the values from above.

.. code:: python

      def prepare_dataset(url=None, sha256=None, prefix='.', skip=False):
          url = url or DATASET_URL
          sha256 = sha256 or DATASET_SHA256
          local = fetch_url(url, sha256=sha256, prefix=prefix, dryRun=skip)

          if not skip:
              print ('Extracting', local, 'to', prefix)
              with zipfile.ZipFile(local, 'r') as zip:
                  zip.extractall(prefix)

          name, _ = os.path.splitext(local)
          return name


      def locate_paths(path_md5list, prefix):
          with open(path_md5list) as fd:
              for line in itertools.imap(str.strip, fd):
                  parts = line.split()
                  if not len(parts) -- 2: continue
                  md5sum, path = parts
                  chksum = Checksum(value=md5sum, kind='md5')
                  filepath = os.path.join(prefix, path)
                  yield Path(checksum=chksum, filepath=filepath)


      def locate_images(paths):

          def predicate(path):
              _, ext = os.path.splitext(path.filepath)
              return ext in ['.png']

          for path in itertools.ifilter(predicate, paths):
              yield image(id=path.checksum.value, path=path)

Data Model
----------

We'll define some classes so we have a nice API for working with the
dataflow. We set ``slots=True`` so that the resulting objects will be
more space-efficient.

Utilities
^^^^^^^^^

Checksum
~~~~~~~~

The checksum consists of the actual hash value (``value``) as well as a
string representing the hashing algorithm. The validator enforces that
the algorithm can only be one of the listed acceptable methods.

.. code:: python

      @attr.s(slots=True)
      class Checksum(object):
        value = attr.ib()
        kind = attr.ib(validator=lambda o, a, v: v in 'md5 sha1 sha224 sha256 sha384 sha512'.split())

Path
~~~~

``Path`` s refer to an image's filepath and associated ``Checksum``. We
get the checksum "for free" since the MD5 hash is provided for each
image in the dataset.

.. code:: python

      @attr.s(slots=True)
      class Path(object):
          checksum = attr.ib()
          filepath = attr.ib()

Image
^^^^-

The start of the data pipeline is the image. An ``image`` is has an id
(the md5 hash) and the path to the image.

.. code:: python

      @attr.s(slots=True)
      class image(object):
          id = attr.ib()
          path = attr.ib()

Mindtct
^^^^^^^

The next step in the pipeline to to apply ``mindtct`` from NBIS. A
``mindtct`` object therefor represents the results of applying
``mindtct`` on an ``image``. The ``xyt`` output is needed for the next
step, and the ``image`` attribute represent the image id.

.. code:: python

      @attr.s(slots=True)
      class mindtct(object):
          image = attr.ib()
          xyt = attr.ib()

We need a way to construct a ``mindtct`` object from an ``image``
object. A straightforward way of doing this would be to have a
``from_image`` ``@staticmethod`` or ``@classmethod``, but that doesn't
work well with ``multiprocessing`` as top-level functions work best
(they need to be serialized).

.. code:: python

      def mindtct_from_image(image):
          imgpath = os.path.abspath(image.path.filepath)
          tempdir = tempfile.mkdtemp()
          oroot = os.path.join(tempdir, 'result')

          cmd = ['mindtct', imgpath, oroot]

          try:
              subprocess.check_call(cmd)

              with open(oroot + '.xyt') as fd:
                  xyt = fd.read()

              result = mindtct(image=image.id, xyt=xyt)
              return result

          finally:
              shutil.rmtree(tempdir)

Bozorth3
^^^^^^^^

The final step is the pipeline is calling out to the ``bozorth3``
program from NBIS. The ``bozorth3`` class represent the match done:
tracking the ids of the probe and gallery images as well as the match
score.

Since we'll be writing these instances out to a database, we provide
some static methods for SQL statements. While there are many
Object-Relational-Model (ORM) libraries available for Python, we wanted
to keep this implementation simpler.

.. code:: python

      @attr.s(slots=True)
      class bozorth3(object):
          probe = attr.ib()
          gallery = attr.ib()
          score = attr.ib()


          @staticmethod
          def sql_stmt_create_table():
              return 'CREATE TABLE IF NOT EXISTS bozorth3 (probe TEXT, gallery TEXT, score NUMERIC)'


          @staticmethod
          def sql_prepared_stmt_insert():
              return 'INSERT INTO bozorth3 VALUES (?, ?, ?)'


          def sql_insert_values(self):
              return self.probe, self.gallery, self.score

In order to work well with ``multiprocessing``, we define a class
representing the input parameters to ``bozorth3`` and a helper function
to run ``bozorth3``. This way the pipeline definition can be kept simple
to a ``map`` to create the input and then a ``map`` to run the program.

As NBIS ``bozorth3`` can be called to compare one-to-one or one-to-many,
we'll also dynamically choose between these approaches depending on if
the gallery is a list or a single object.

.. code:: python

      @attr.s(slots=True)
      class bozorth3_input(object):
          probe = attr.ib()
          gallery = attr.ib()

          def run(self):
              if isinstance(self.gallery, mindtct):
                  return bozorth3_from_group(self.probe, self.gallery)
              elif isinstance(self.gallery, types.ListType):
                  return bozorth3_from_one_to_many(self.probe, self.gallery)
              else:
                  raise ValueError('Unhandled type for gallery: {}'.format(type(gallery)))


      def run_bozorth3(input):
          return input.run()

One-to-one
~~~~~~~~~~

Here, we define how to run NBIS ``bozorth3`` on a one-to-one input:

.. code:: python

      def bozorth3_from_group(probe, gallery):
          tempdir = tempfile.mkdtemp()
          probeFile = os.path.join(tempdir, 'probe.xyt')
          galleryFile = os.path.join(tempdir, 'gallery.xyt')

          with open(probeFile, 'wb')   as fd: fd.write(probe.xyt)
          with open(galleryFile, 'wb') as fd: fd.write(gallery.xyt)

          cmd = ['bozorth3', probeFile, galleryFile]

          try:
              result = subprocess.check_output(cmd)
              score = int(result.strip())

              return bozorth3(probe=probe.image, gallery=gallery.image, score=score)
          finally:
              shutil.rmtree(tempdir)

One-to-many
~~~~~~~~~~~

Calling NBIS one-to-many turns out to be more efficient than the
overhead of starting a ``bozorth3`` process for each pair.

.. code:: python

      def bozorth3_from_one_to_many(probe, galleryset):
          tempdir = tempfile.mkdtemp()
          probeFile = os.path.join(tempdir, 'probe.xyt')
          galleryFiles = [os.path.join(tempdir, 'gallery%d.xyt' % i) for i, _ in enumerate(galleryset)]

          with open(probeFile, 'wb') as fd: fd.write(probe.xyt)
          for galleryFile, gallery in itertools.izip(galleryFiles, galleryset):
              with open(galleryFile, 'wb') as fd: fd.write(gallery.xyt)

          cmd = ['bozorth3', '-p', probeFile] + galleryFiles

          try:
              result = subprocess.check_output(cmd).strip()
              scores = map(int, result.split('\n'))
              return [bozorth3(probe=probe.image, gallery=gallery.image, score=score)
                      for score, gallery in zip(scores, galleryset)]
          finally:
              shutil.rmtree(tempdir)

Plotting
--------

For plotting we'll operation only on the database. We'll choose a small
number of probe images and plot the score between them and the rest of
the gallery images.

.. code:: python

      def plot(dbfile, nprobes=10, outfile='figure.png'):

          conn = sqlite3.connect(dbfile)

          results = pd.read_sql("SELECT probe FROM bozorth3 ORDER BY score LIMIT '%s'" % nprobes,
                                con=conn)

          shortlabels = mk_short_labels(results.probe)

          plt.figure()

          for i, probe in results.probe.iteritems():
              stmt = 'SELECT gallery, score FROM bozorth3 WHERE probe = ? ORDER BY gallery DESC'
              matches = pd.read_sql(stmt, params=(probe,), con=conn)
              xs = np.arange(len(matches), dtype=np.int)
              plt.plot(xs, matches.score, label='probe %s' % shortlabels[i])

          plt.ylabel('Score')
          plt.xlabel('Gallery')
          plt.legend()
          plt.savefig(outfile)

The image ids are long hash strings. In order to minimize the amount of
space on the figure the labels take, we provide a helper function to
create a short label that still uniquely identifies each probe image in
the selected sample.

.. code:: python

      def mk_short_labels(series, start=7):
          for size in xrange(start, len(series[0])):
              if len(series) -- len(set(map(lambda s: s[:size], series))):
                  break

          return map(lambda s: s[:size], series)

Main Entry Point
----------------

Puting it all together

.. code:: python

      if __name__ -- '__main__':


          prefix = sys.argv[1]

          DBFILE = os.path.join(prefix, 'scores.db')
          PLOTFILE = os.path.join(prefix, 'plot.png')

          md5listpath = sys.argv[2]
          perc_probe = float(sys.argv[3])
          perc_gallery = float(sys.argv[4])

          pool = multiprocessing.Pool()
          conn = sqlite3.connect(DBFILE)
          cursor = conn.cursor()

          cursor.execute(bozorth3.sql_stmt_create_table())


          dataprefix = prepare_dataset(prefix=prefix, skip=True)

          print ('Loading images')
          paths = locate_paths(md5listpath, dataprefix)
          images = locate_images(paths)
          mindtcts = pool.map(mindtct_from_image, images)
          mindtcts = list(mindtcts)


          print ('Generating samples')
          probes  = random.sample(mindtcts, int(perc_probe   * len(mindtcts)))
          gallery = random.sample(mindtcts, int(perc_gallery * len(mindtcts)))
          input   = [bozorth3_input(probe=probe, gallery=gallery) for probe in probes]

          print ('Matching')
          bozorth3s = pool.map(run_bozorth3, input)
          for group in bozorth3s:
              vals = map(bozorth3.sql_insert_values, group)
              cursor.executemany(bozorth3.sql_prepared_stmt_insert(), vals)
              conn.commit()
              map(print, group)


          conn.close()

          plot(DBFILE, nprobes=5, outfile=PLOTFILE)

Running
-------

You can run the code like so:

.. code:: bash

      time python python_lesson1.py \
           python_lesson1 \
           NISTSpecialDatabase4GrayScaleImagesofFIGS/sd04/sd04_md5.lst \
           0.001 \
           0.1

This will result in a figure like the following

.. figure:: ./python_lesson1/plot.png
   :alt: pyl1

   Fingperprint Match scores