.. _projects:

Software Projects
=================

.. sidebar:: Contents

   .. contents::
      :local:

Please read the information in the overview page at

*
http://bdaafall2016.readthedocs.io/en/latest/overview.html#software-project

After doing so please return to this page. Identify a project suitable
for this class, propose it and work on it.

There are several categories of software projects, which are detailed in
lower sections:

#. Deployment
#. Analytics

You may propose a project in one of these categories, if you are doing
a software projects.

.. warning::

   These are non-trivial project and involve substantial work.  Many
   students vastly underestimate the difficulty and the amount of time
   required. This is the reason why the project assignment is early on
   in the semester so you have ample time to propose and work on
   it. If you start the project 2 weeks before December (Note the
   early due data) We assume you may not finish.


Common Requirements
-------------------

All software projects must:

#. Be submitted via gitlab (a repository will be created for you)
#. Be reproducibly deployed

   Assume you are given a username and a set of IP addresses.  From
   this starting point, you should be able to deploy everything in a
   single command line invocation.

   .. warning::

      Do not assume that the username or IP address will be the ones
      you use during development and testing.

#. Provide a report in the ``report`` directory

   LaTeX or Word may be used. Include the original sources as well as a PDF called ``report.pdf``
   (See :ref:`overview-software-project` for additional details on the
   report format. You will be using 2 column ACM format we have used before.)

#. Provide a properly formatted ``README.rst`` in the root directory

   The README should have the following sections:

   - Authors: list the authors
   - Project Type: one of "Deployment", "Analytics"
   - Problem: describe the task and/or problem
   - Requirements: describe your assumptions and requirements for deployment/running.
     This should include any software requirements with a link to their webpage.
     Also indicate which versions you have developed/tested with.

   - Running: describe the steps needed to deploy and run
   - Acknowledgements: provide proper attribution to any websites, or
     code you may have used or adapted

   .. warning:: in the past we got projects that had 10 pages
		installation instructions. Certainly that is not good
		and you will get point deductions. The installation
		should be possible in a couple of lines. A nice
		example is the installation of the development software
		in the ubuntu vm. Naturally you can use other
		technologies, other than ansible. Shell scrips,
		makefiles, python scripts are all acceptable.
     
#. A ``LICENSE`` file (this should be the ``LICENSE`` for Apache License Version 2.0)
#. All figures should include labels with the following format: ``label (units)``.

   For example:

   - ``distance (meters)``
   - ``volume (liters)``
   - ``cost (USD)``

#. All figures should have a caption describing what the measurement
   is, and a summary of the conclusions drawn.

   For example:

     This shows how A changes with regards to B, indicating that under
     conditions X, Y, Z, Alpha is 42 times better than otherwise.

Deployment Projects
-------------------

Deployment projects focuses on automated software deployments on
multiple nodes using automation tools such as Ansible, Chef, Puppet,
Salt, or Juju. You are also allowed to use shell scripts, pdsh,
vagrant, or fabric. For example, you could work on deploying Hadoop to
a cluster of several machines. Use of Ansible is recommended and
supported. Other tools such as Chef, Puppet, etc, will not be
supported.

Note that it is not sufficient to merely deploy the software on the
cluster. You must also demonstrate the use of the cluster by running
some program on it and show the utilization of your entire cluster.
You should also benchmark the deployment and running of your
demonstration on several sizes of a cluster (eg 1, 3, 6, 10 nodes)
(Note that these numbers are for example only).

We expect to see figures showing times for each (deployment, running)
pair on for each cluster size, with error bars.  This means that you
need to run each benchmark multiple times (at least three times) in
order to get the error bars. You should also demonstrate cluster
utilization for each cluster size.

The program used for demonstration can be simple and straightforward.
This is not the focus of this type of project.

IaaS
----

It is allowable to use

* virtualbox
* chameleon cloud
* futuresystems
* AWS (your own cost)
* Azure (your own cost)

for your projects. Note that on powerful desktop machines even
virtualbox can run multiple vms.  Use of docker is allowed, but you
must make sure to use docker properly. In the past we had students
that used docker but did not use it in the way it was designed
for. Use of docker swarm is allowed.
  
Requirements
~~~~~~~~~~~~

.. todo:: list requirements as differing from "Common Requirements"


Example projects
~~~~~~~~~~~~~~~~

- deploy Apache Spark on top of Hadoop
- deploy Apache Pig on top of Hadoop
- deploy Apache Storm
- deploy Apache Flink
- deploy a Tensorflow cluster
- deploy a PostgreSQL cluster
- deploy a MongoDB cluster
- deploy a CouchDB cluster
- deploy a Memcached cluster
- deploy a MySQL cluster
- deploy a Redis cluster
- deploy a Mesos cluster
- deploy a Hadoop cluster
- deploy a docker swarm cluster
- deploy NIST Fingerprint Matching
- deploy NIST Human Detection and Face Detection
- deploy NIST Live Twitter Analysis
- deploy NIST Big Data Analytics for Healthcare Data and Health Informatics
- deploy NIST Data Warehousing and Data mining

Deployment projects must have EASY installation setup just as we
demonstrated in the ubuntu image.

A command to manage the deployment must be written using python
docopts that than starts your deployment and allows management of it.
You can than from within this command call whatever other framework
you use to manage it. The docopts manual page should be designed first
and discussed in the team for completeness.

Using argparse and other python commandline interface environments is
not allowed.

Deployment project will not only deply the farmewor, but either
provide a sophisticated benchmark while doing a simple analysis using
the deployed software.


Analytics Projects
------------------

Analytics projects focus on data exploration.  For this type of
projects, you should focus on analysis of a dataset (see
:doc:`datasets` for starting points).  The key here is to take a
dataset and extract some meaningful information from in using tools
such as ``scikit-learn``, ``mllib``, or others.  You should be able to
provide graphs, descriptions for your graphs, and argue for
conclusions drawn from your analysis.

Your deployment should handle the process of downloading and
installing the required datasets and pushing the analysis code to the
remote node.  You should provide instructions on how to run and
interpret your analysis code in your README.


Requirements
~~~~~~~~~~~~

.. todo:: list requirements as differing from "Common Requirements"


Example projects
~~~~~~~~~~~~~~~~

- analysis of US Census data
- analysis of Uber ride sharing GPS data
- analysis of Health Care data
- analysis of images for Human Face detection
- analysis of streaming Twitter data
- analysis of airline prices, flights, etc
- analysis of network graphs (social networks, disease networks, protein networks, etc)
- analysis of music files for recommender engines
- analysis of NIST Fingerprint Matching
- analysis of NIST Human Detection and Face Detection
- analysis of NIST Live Twitter Analysis
- analysis of NIST Big Data Analytics for Healthcare Data and Health Informatics
- analysis of NIST Data Warehousing and Data mining
- author disambiguity problem in academic papers
- application of a k-means algorithm
- application of a MDS 


Project Idea: World wide road kill
-------------------------------------

This project can also be executed as bonus project to gather
information about the feasability of existing databases.
 
It would be important to identify also how to potentially merge these
databases into a single world map and derive statistics from
them. This project can be done on your local machines. Not more than 6
people can work on this.
 
Identify someone that has experience with android and/or iphone
programming Design an application that preferably works on iphone and
android that allows a user while driving to
 
* call a number to report roadkill via voice and submitting the gps coordinates
*  have a button on the phone that allows the gps coordinates to be collected and allow upload either live, or when the user presses another butten.
*  have provisions in the application that allow you to augment the data
*  have an html page that displays the data 
*  test it out within users of this class (remember we have world wide audience) 
Make sure the app is ready early so others can test and use it and you can collect data.
 
Before starting the project identify if such an application already exists.
 
If more than 6 people sign up we may build a second group doing something similar, maybe potholes ..

Gregor would like to get this project or at least the database search
query staffed.


Project Idea: Author disambiguty problem
----------------------------------------------------------------------

Given millions of publications how do we identify if an author of
paper a with the name Will Smith is the sam as the author of paper 2
with the name Will Smith, or William Smith, or W. Smith. AUthor
databases are either provided in bibtex format, or a database that can
not be shared outside of this class. YOu may have to add additional
information from IEEE explorer, rsearch gate, ISI, or other online databases.

Identify further issues and discuss solutions to them. Example, an
author name changes, the author changes the institution. 

Do a comprehensive literature review

Some ideas:

* Develop a graph view application in JS that showcases
dependencies between coauthors, institutions 

* Derive probabilities for the publications written by an auther given
  they are the same

* Utilize dependency graphs as given by online databases

* Utilize the and or topic/abstarct/full text to identify similarity

* Utilize keywords in the title

* Utilize refernces of the paper

* Prepare some vizualization of your result

* Prepare som interactive vizualization

A possible good start is a previous project published at

* https://github.com/scienceimpact/bibliometric

There are also some screenshots available:

*
https://github.com/scienceimpact/bibliometric/blob/master/Project%20Screenshots/Relationship_Authors_Publications.PNG

*
https://github.com/scienceimpact/bibliometric/blob/master/Project%20Screenshots/Relationship_Authors_Publications2_Clusters.PNG

  
..
   .. _sampleprojects:

   Sample Project suggestions
   ===========================


   Example Projects
   ------------------

   These are projects that will be supported on FutureSystems resources.
   Certain projects, such as NIST Fingerprint, may be accomplished by
   running a subset of 1 or more of the software packages.


   +-------------------------------------------------------+--------------------------------+-------------------------------------------------------+
   | **Title**                                             | **Data set**                   | **Software**                                          |
   +-------------------------------------------------------+--------------------------------+-------------------------------------------------------+
   | | **Category: Batch Data Analytics**                  |                                |                                                       |
   +-------------------------------------------------------+--------------------------------+-------------------------------------------------------+
   | | NIST_Fingerprint_ (a subset of):                    | | NISTDatabase27A_ [4GB]       | | NISTBiometric_                                      |
   | | NFIQ                                                |                                | | Image Software (NBIS) v5.0 Userguide_              |
   | | PCASYS                                              |                                | |                                                     |
   | | MINDTCT                                             |                                | |                                                     |
   | | BOZORTH3                                            |                                | |                                                     |
   | | NFSEG                                               |                                | |                                                     |
   | | SIVV                                                |                                | |                                                     |
   +-------------------------------------------------------+--------------------------------+-------------------------------------------------------+
   | | Hadoop Benchmark                                    |                                |                                                       |
   | | TeraSort Suite                                      | | Teragen                      | hadoop-examples.jar                                   |
   +-------------------------------------------------------+--------------------------------+-------------------------------------------------------+
   | | Hadoop Benchmark                                    |                                |                                                       |
   | | DFSIO (HDFS Performance)                            |                                | hadoop-mapreduce-client-jobclient                     |
   +-------------------------------------------------------+--------------------------------+-------------------------------------------------------+
   | | Hadoop Benchmark                                    |                                |                                                       |
   | | NNBench (NameNode Perf.)                            |                                | hadoop-mapreduce-client-jobclient                     |
   +-------------------------------------------------------+--------------------------------+-------------------------------------------------------+
   | | Hadoop Benchmark                                    |                                |                                                       |
   | | MRBench (MapReduce Perf.)                           |                                | src/test/org/apache/hadoop/mapred/MRBench.java        |
   +-------------------------------------------------------+--------------------------------+-------------------------------------------------------+
   | | Stock Data Analysis with MPI                        | | CRSP_ Stock Analysis         | | Streaming Data Analytics                            |
   | |                                                     | | e.g. Trading Symbol,         | |                                                     |
   | |                                                     | | Price                        | |                                                     |
   | |                                                     | | Number of Shares Outstanding | |                                                     |
   | |                                                     | | Factor to adjust price       | |                                                     |
   | |                                                     | | Factor to adjust shares      | |                                                     |
   +-------------------------------------------------------+--------------------------------+-------------------------------------------------------+

   Note: 
   * TeraSort: hadoop-examples.jar is included in hadoop package.

   * MRBench, NNBench, DFSIO: hadoop-mapreduce-client-jobclient-2.7.1.jar is included as well. If not, it can be downloaded directly from
     `*here* <https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-mapreduce-client-jobclient/2.7.1/hadoop-mapreduce-client-jobclient-2.7.1.jar>`__.

    Brief guidelines for these benchmark tools from last year:

   -  `TeraSort Hadoop
      Benchmark <http://bdaafall2015.readthedocs.io/en/latest/terasort.html#terasort>`__

   -  `DFSIO Distributed I/O
      Benchmark <http://bdaafall2015.readthedocs.io/en/latest/dfsio.html#dfsio>`__

   -  `MRBench MapReduce
      Benchmark <http://bdaafall2015.readthedocs.io/en/latest/mrbench.html#mrbench>`__

   `NNBench NameNode
   Benchmark <http://bdaafall2015.readthedocs.io/en/latest/nnbench.html#nnbench>`__


   .. _NISTFIngerprint: http://www.nist.gov/itl/iad/ig/nbis.cfm

   .. _NISTDataset27A: http://www.nist.gov/itl/iad/ig/sd27a.cfm

   .. _NISTBiometric: http://nigos.nist.gov:8080/nist/nbis/nbis_v5_0_0.zip

   .. _Userguide: https://soic.scholargrid.org/courses/course-v1:iudatascience+I523-I423-ENG599+FALL_2016/info

   .. _CRSP: https://wrds-web.wharton.upenn.edu/wrds/

   Other Possible Projects
   -----------------------

   These are projects for which there may be tentative, or no, direct
   support on FutureSystems resources.


   +--------------------------------------+------------------------------------------------+------------------+
   | **Title**                            | **Data set**                                   | **Software**     |
   +--------------------------------------+------------------------------------------------+------------------+
   | **Category: Batch Data Analytics**                                                                       |
   +--------------------------------------+------------------------------------------------+------------------+
   | Census                               | | Data1_ csv files downloadable                | | n/a            |
   |                                      | | click "Internet tables" to select subsets)   | |                |
   +--------------------------------------+------------------------------------------------+------------------+
   | Amazon Movie Reviews (1997-2012)     | Data3_ 3GB (compressed)                        |                  |
   +--------------------------------------+------------------------------------------------+------------------+
   | Medicare Part-B (2000-2013)          | Data4_ <30 MB, CSV ('00-'09), Excel ('10-'13)  | n/a              |
   +--------------------------------------+------------------------------------------------+------------------+
   | HiBench        - sort                | n/a                                            | HibenchSuite_    |
   +--------------------------------------+------------------------------------------------+------------------+
   | HiBench        - wordcount           | n/a                                            | HibenchSuite_    |
   +--------------------------------------+------------------------------------------------+------------------+
   | HiBench        - terasort            | n/a                                            | HibenchSuite_    |
   +--------------------------------------+------------------------------------------------+------------------+
   | HiBench        - scan/join/aggregate | n/a                                            | HibenchSuite_    |
   +--------------------------------------+------------------------------------------------+------------------+
   | HiBench        - pagerank            | n/a                                            | HibenchSuite_    |
   +--------------------------------------+------------------------------------------------+------------------+
   | HiBench        - netchindexing       | n/a                                            | HibenchSuite_    |
   +--------------------------------------+------------------------------------------------+------------------+
   | HiBench        - bayes               | n/a                                            | HibenchSuite_    |
   +--------------------------------------+------------------------------------------------+------------------+
   | HiBench        - kmeans              | n/a                                            | HibenchSuite_    |
   +--------------------------------------+------------------------------------------------+------------------+
   | HiBench        - dfsio               | n/a                                            | HibenchSuite_    |
   +--------------------------------------+------------------------------------------------+------------------+
   | Movie Reviews using IPython          | Data from Rottentomatoes.com                   | IPython1_        |
   +--------------------------------------+------------------------------------------------+------------------+
   | Red Wine Quality using IPython       | REDWINE_                                       | IPython2_        |
   +--------------------------------------+------------------------------------------------+------------------+
   | Airline Delays with Hadoop           | AIRLINE                                        | IPython3_        |
   +--------------------------------------+------------------------------------------------+------------------+
   | BigBench                             | n/a                                            | BDBench_         |
   +--------------------------------------+------------------------------------------------+------------------+
   | Genome sequence data                 | .cfa sample data (unstructured)                | SANDDATA_        |
   +--------------------------------------+------------------------------------------------+------------------+
   | **Category: Streaming Data Analytics**                                                                   |
   +--------------------------------------+------------------------------------------------+------------------+
   | Face Detection                       | Data2_ images from INRIA dataset (< 1GB)       | OpenCV           |
   +--------------------------------------+------------------------------------------------+------------------+
   | Live Twitter Feed analysis           | Live Twitter feed                              |                  |
   +--------------------------------------+------------------------------------------------+------------------+
   | Drug-Drug interactions on Twitter    | Live Twitter Data                              | DRUG_            |
   +--------------------------------------+------------------------------------------------+------------------+


   .. _Data1: http://www.census.gov/population/www/cen2010/glance/

   .. _Data2: http://pascal.inrialpes.fr/data/human/

   .. _Data3: http://snap.stanford.edu/data/web-Movies.html

   .. _Data4: https://www.cms.gov/Research-Statistics-Data-and-Systems/Downloadable-Public-Use-Files/Part-B-National-Summary-Data-File/Overview.html

   .. _HibenchSuite: https://github.com/intel-hadoop/HiBench

   .. _iPython1: http://nbviewer.ipython.org/github/cs109/content/blob/master/HW3_solutions.ipynb

   .. _iPython2: http://nbviewer.ipython.org/github/cs109/2014/blob/master/homework-solutions/HW5-solutions.ipynb

   .. _iPython3: http://nbviewer.ipython.org/github/ofermend/IPython-notebooks/blob/master/blog-part-1.ipynb

   .. _BDBench: https://github.com/intel-hadoop/Big-Data-Benchmark-for-Big-Bench

   .. _DRUG:  https://github.com/cloud-class-projects/drug-drug-interaction

   .. _SAND: http://ccl.cse.nd.edu/software/sand/

   .. _SANDDATA: http://ccl.cse.nd.edu/software/sand/

   .. _REDWINE:  https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/

   .. _AIRLINE:  http://stat-computing.org/dataexpo/2009/the-data.html


   Your Own Projects
   -----------------

   You have an option to create your own project with your idea. You can
   use Python, Java, R, or other languages that you prefer. The size or the
   domain of your datasets is open as long as they can be handled and
   reproduced by course instructors.

   Non-Software Projects
   ---------------------

   If you have selected non-software projects, you or your team can develop
   your project without software development or applications.

   Use examples given below to choose a project. You can follow one of
   these examples or choose your own.


   * Survey HPC-ABDS; Several topics such as review level 17 (orchestration),
     Compare level 6 (DevOps) and level 15B (PaaS Frameworks) and level 17;
     KALEIDOSCOPE_

   * Review of Recommender Systems: Technology & Applications ; Define
     classification of information filtering system with current technologies
     and applications ; RECOMENDER_

   * Review of Big Data in Bioinformatics; Find current challenges and
     understand state of bioinformatics solutions for big data including
     analytics, security and privacy.

   * Review of Data visualization including high dimensional data; Explore
     data mining methods for knowledge discovery with data visualization
     tools e.g. D3.js, matplotlib

   * Design of a NoSQL database for a specialized application; Explore
     design of databases for big data including HBase, MongoDB, etc.

   .. _KALEIDOSCOPE: http://hpc-abds.org/kaleidoscope
   .. _RECOMENDER: http://bdaafall2015.readthedocs.org/en/latest/tp1-recommender.html#tp1-recommender


   NIST Examples
   ----------------------------------------------------

   -  **NIST**

      -  **NFIQ**: `NIST Fingerprint Image Quality (NFIQ) <http://biometrics.nist.gov/cs_links/standard/archived/workshops/workshop1/presentations/Tabassi-Image-Quality.pdf>`__,
             Tabassi, Elham,
             C. Wilson, and C. Watson. "Nist fingerprint image
             quality." NIST Res. Rep. NISTIR7151 (2004).
      -  **PCASYS**: `Fingerprint Pattern Classification <http://www.nist.gov/manuscript-publication-search.cfm?pub_id=900754>`__,
             Candela, G. T., et al. "PCASYS-A pattern-level classification automation system
             for fingerprints." *NIST technical report NISTIR* 5647 (1995).

      -  MINDTCT

      -  BOZORTH3

      -  NFSEG

      -  SIVV: `pdf <http://www.nist.gov/manuscript-publication-search.cfm?pub_id=903078>`__