Overview

This page may be updated throughout Fall 2016, we recommend to review this page weekly for changes.

About the Course

The Big Data Applications and Analytics course is an overview course in Data Science and covers the applications and technologies (data analytics and clouds) needed to process the application data. It is organized around rallying cry: Use Clouds running Data Analytics Collaboratively processing Big Data to solve problems in X-Informatics.

Course Numbers

This course is offered for Graduate and Undergraduate students at Indiana University and as an online course. To Register, for University credit please go to:

Please, select the course that is most suitable for your program:

  • INFO-I 423 - BIG DATA APPLS & ANALYTICS
    • 34954 online undergraduate students
    • 34955 Discussion Friday 9:30 - 10:45AM Informatics East (I2) 150
  • INFO-I 523 - BIG DATA APPLS & ANALYTICS
    • 32863 online graduate students
    • 32864 Discussion Friday 9:30 - 10:45AM Informatics East (I2) 150
    • 32866 Data Science majors only
  • ENGR-E 599 - TOPICS IN INTELLIGENT SYSTEMS ENGINEERING
    • 36362 online graduate engineering students
    • 36363 Discussion Friday 9:30 - 10:45AM Informatics East (I2) 150

Warning

Please note that all discussion sections for residential students have been merged to:

  • Friday 9:30 - 10:45AM Informatics East (I2) 150

Please ignore postings in CANVAS and the REGISTRAR about this.

From Registrar (however with updated meeting times and location):

INFO-I 523  BIG DATA APPLS & ANALYTICS (3 CR)
     CLSD *****          ARR             ARR    ARR       Von Laszewski G          50    0    2
             Above class open to graduates only
             Above class taught online
             Discussion (DIS)
     CLSD 32864          09:30A-10:45A   F      I2 150    Von Laszewski G          50    0    2
             Above class meets with INFO-I 423

INFO-I 523  BIG DATA APPLS & ANALYTICS (3 CR)
    I 523 : P - Data Science majors only
          32866 RSTR     ARR             ARR    ARR       Von Laszewski G          90   72    0
             This is a 100% online class taught by IU Bloomington. No
             on-campus class meetings are required. A distance education
             fee may apply; check your campus bursar website for more
             information
             Above class for students not in residence on the Bloomington
             campus

INFO-I 423  BIG DATA APPLS & ANALYTICS (3 CR)
     CLSD ***** RSTR     ARR             ARR    ARR       Von Laszewski G          10    0    6
             Above class open to undergraduates only
             Above class taught online
             Discussion (DIS)
     CLSD 34955 RSTR     09:30A-10:45A   F      I2 150    Von Laszewski G          10    0    6
             Above class meets with INFO-I 523

ENGR-E 599  TOPICS IN INTELL SYS ENGINEER (3 CR)
       VT: BG DATA APPLICATNS & ANLYTCS ISE
          ***** RSTR     ARR             ARR    ARR       Von Laszewski G          25   25    0
             Above class open to graduate engineering students only
             Above class taught online
             Discussion (DIS)
       VT: BG DATA APPLICATNS & ANLYTCS ISE
          36363 RSTR     01:00P-02:15P   F      HD TBA    Von Laszewski G          25   25    0
              Above class meets with INFO-I 523

Meeting Times

The classes are published online. Residential students at Indiana University will participate in a discussion taking place at the following time:

  • Fridays 09:30am - 10:45am EST, I2 150

For the 100% online students see the office hours.

Office Hours

Office hours will be held every week

  • Tue 10-11am EST, typically Gregor
  • Thu 6-7pm EST, typically Gregor
  • Sun 4-6pm EST, either Jerome or Prashanth
  • Tue 7-8pm, either Jerome or Prashanth
  • Wed 7-8pm, either Jerome or Prashanth

These are live sessions that will allow you to interact in group or one-on-one with either an instructor or a TA. Office hours sessions may be recorded. All importan FAQs will be either posted on the Web page or in Piazza ASAP. During these times, we can be reached via zoom with the following information for the call:

Join from PC, Mac, Linux, iOS or Android:

Or Telephone:

  • Or a H.323/SIP room system:
    • H.323: 162.255.37.11 (US West) or 162.255.36.11 (US East)
    • Meeting ID: 195 576 919
    • SIP: 195576919@zoomcrc.com

Please use a headphone with microphone to increase sound quality.

Discussions and communication

Online discussions and communication will be conducted in piazza at the following URL:

https://piazza.com/iu/fall2016/infoi523/home

Discussions are conducted in clearly marked folders/topics. For example “Discussion d1” will be conducted in the piazza folder “d1”. Students are responsible for posting their content to the right folder. No credit will be given if the post has been filed wrongly.

Please note that the communications to instructors can be seen by all instructors. In matters that are sensitive, please use gvonlasz@indiana.edu. Please, never share your university ID number or your social security number or any other sensitive information with us either in e-mail or in the discussion lists.

Calendar

All sessions refer to Sections, Discussions and Units

  • This document supersedes any assignment dates and comments regarding assignments made in videos or stated elsewhere
  • Official and additional announcements will be send via CANVAS
  • All lectures are assigned Friday’s
  • All discussions and homework are due 3 weeks after the assignment + the next weekend, e.g. Monday Morning if not specified otherwise. Precise dates will be published in CANVAS
  • Note calendar and content may change
Assigned Wk Week Descriptions
08/22/2016 1 W1
08/26/2016 2 W2
09/02/2016 3 W3
09/05/2016 3 Holiday
Labor Day
09/09/2016 4 W4
09/16/2016 5 W5
Work on Project
Learn enough Python(2)
09/23/2016 6 W6
Work on Project
Learn enough Python(2)
09/30/2016 7 W7
Project or paper proposal due!
10/07/2016 7 No Lectures
No Lectures(1)
10/08/2016 7 No Lectures
No Lectures(1)
10/09/2016 7 No Lectures
No Lectures(1)
10/07/2016 8 W8
10/14/2016 9 W9
Programming prg1: Python recommended due date
for students with projects
Work on Project
10/21/2016 10 W10
10/28/2016 11 W11
11/04/2016 12 W12

11/09/2016

11/11/2016

13

13

W12

W13

with term paper (due Dec 9)
Work on Project
11/20/2016 14 No Lectures
Thanksgiving break Starts(1)
11/27/2016 14 No Lectures
Thanksgiving break Ends(1)
12/02/2016 15 Due Date
Due Date for papers and projects
Work on Project
12/09/2016 15 Due Date
PRG-GEO: Geolocation students with
term paper
12/12/2016 16 Last Class
Last chance overdue homework due
Improve Project (5)
12/16/2016 17 Last Day
End Date of Semester
  • ( 1 ) Use lecture free time wisely
  • ( 2 ) Improve your python knowledge while you do your project
  • ( 3 ) If you can not do PRG by Oct 10/14 or have difficulties with it, we recommend that you do a paper
  • ( 4 ) we will not do PRG2, and PRG3 in this class
  • ( 5 ) if you have homework late past Dec 2nd you may run the risk of obtaining an incomplete in the class as grading may need time and will be conducted in January.
  • ( 6 ) Paper p11 has been canceled so you can focus on your project

The following sections will be replaced:

Common Mistakes

  • Starting the Project late.
  • Not using gitlab for homework submission
  • Not using the 2 column ACM report template
  • Not using jabref or endnote for References
  • Not understanding plagiarism
  • Being in a team where one team member does not perform
  • Violating university policy by doing another students work
  • Not using frequent checkins to gitlab and pushing the commits

Systems Usage

Projects may be executed on your local computer, a cloud or other resources you may have access to. This may include:

  • chameleoncloud.org
  • furturesystems.org
  • AWS (you will be responsible for charges)
  • Azure (you will be responsible for charges)
  • virtualbox if you have a powerful computer and like to prototype
  • other clouds

Term Paper or Project

You have a choice to write a term paper or do a software project. This will constitute to 50% of your class grade.

In case you chose a project your maximum grade could be an A+. However, an A+ project must be truly outstanding and include an exceptional project report. Such a project and report will have the potential quality of being able to be published in a conference.

In case you chose a Term Paper your maximum Grade for the entire class will be an A-.

Please note that a project includes also writing a project report/paper. However the length is a bit lower than for a term paper.

Software Project

In case of a software project, we encourage a group project with up to three members. You can use the discussion forum in the folder project to form project teams or just communicate privately with other class members to formulate a team. The following artifacts are part of the deliverables for a project

Code:
You must deliver the code in gitlab. The code must be compilable and a TA may try to replicate to run your code. You MUST avoid lengthy install descriptions and everything must be installable from the command line. We will check submission. All team members must be responsible for one part of the project.
Project Report:

A report must be produced while using the format discussed in the Report Format section. The following length is required:

  • 4 pages, one student in the project
  • 6 pages, two students in the project
  • 8 pages, three students in the project
Work Breakdown:

This document is only needed for team projects. A one page PDF document describing who did what. It includes pointers to the git history that documents the statistics that demonstrate not only one student has worked on the project.

In addition the graders will go into gitlab, which provides a history of checkins to verify each team member has used gitlab to checkin their contributions frequently. E.g. if we find that one of the students has not checked in code or documentation at all, it will be questioned.

License:
All projects are developed under an open source license such as Apache 2.0 License, or similar. You will be required to add a LICENCE.txt file and if you use other software identify how it can be reused in your project. If your project uses different licenses, please add in a README.rst file which packages are used and which license these packages have.
Additional links:

Term Paper

Teams:
Up to three people. You can use the discussion forum in the folder term-project to build teams.
Term Report:

A report must be produced while using the format discussed in the Report Format section. The following length is required:

In case you chose the term paper, you or your team will pick a topic relevant for the class. You will write a high quality scholarly paper about this topic. The following artifacts are part of the deliverables for a term paper. A report must be produced while using the format discussed in the Report Format section. The following length is required:

  • 6 pages, one student in the project
  • 9 pages, two student in the project
  • 12 pages, three student in the project
Work Breakdown:
This document is only needed for team projects. A one page PDF document describing who did what.
Grading:
As stated above the maximum grade for the entire class will be A- if you deliver a very good paper. However, axceptional term papers are possible and could result in higher grades. They must contain significant contributions and novel ideas so that the paper could be published in a conference or journal. A comprehensive survey would be an example. The page limitation will most likely be exceeded by such work. Number of pages is not reflecting quallity. Refernces must be outstanding.
Additional links:

Report Format

All reports will be using the ACM proceedings format. The MSWord template can be found here:

A LaTeX version can be found at

however you have to remove the ACM copyright notice in the LaTeX version.

There will be NO EXCEPTION to this format. In case you are in a team, you can use either gitlab while collaboratively developing the LaTeX document or use MicrosoftOne Drive which allows collaborative editing features. All bibliographical entries must be put into a bibliography manager such as jabref, endnote, or Mendeley. This will guarantee that you follow proper citation styles. You can use either ACM or IEEE reference styles. Your final submission will include the bibliography file as a separate document.

Documents that do not follow the ACM format and are not accompanied by references managed with jabref or endnote or are not spell checked will be returned without review.

Report Checklist:

  • [ ] Have you written the report in word or LaTeX in the specified format.
  • [ ] In case of LaTeX, have you removed the ACM copyright information
  • [ ] Have you included the report in gitlab.
  • [ ] Have you specified the names and e-mails of all team members in your report. E.g. the username in Canvas.
  • [ ] Have you included all images in native and PDF format in gitlab in the images folder.
  • [ ] Have you added the bibliography file (such as endnote or bibtex file e.g. jabref) in a directory bib.
  • [ ] Have you submitted an additional page that describes who did what in the project or report.
  • [ ] Have you spellchecked the paper.
  • [ ] Have you made sure you do not plagiarize.
  • [ ] Have you structured your directory as given in the sample at https://gitlab.com/cloudmesh/project-000/tree/master
  • [ ] Have you followed the guiedlines given in report/README.rst of project-000 and that you have a report/report.pdf, as well as submit all images in an image folder.

Code Repositories Deliverables

Code repositories are for code, if you have additional libraries that are needed you need to develop a script or use a DevOps framework to install such software. Thus zip files and .class, .o files are not permissible in the project. Each project must be reproducible with a simple script. An example is:

git clone ....
make install
make run
make view

Which would use a simple make file to install, run, and view the results. Naturally you can use ansible or shell scripts. It is not permissible to use GUI based DevOps preinstalled frameworks. Everything must be installable form the command line.

Prerequisites

Python or Java experience is expected. The programming load is modest.

In case you elect a programming project we will assume that you are familiar with the programming languages required as part of the project you suggest. We will limit the languages to Python and JavaScript if you like to do interactive visualization. If you do not know the required technologies, we will expect you to learn it outside of class. For example, Python has a reputation for being easy to learn, and those with strong programming background in another general-purpose programming language (like C/C++, Java, Ruby, etc.) can learn it within a few hours to days dependent on experience level. Please consult the instructor if you have concerns about your programming background. In addition, we may encounter math of various kinds, including linear algebra, probability theory, and basic calculus. We expect that you know them on an elementary level. Students with limited math backgrounds may need to do additional reading outside of class.

In case you are interested in further development of cloudmesh for big data strong Python or JavaScript experience is needed.

You will also need a sufficiently modern and powerful computer to do the class work. Naturally if you expect that you want to to the course only on your cell phone or iPad, or your windows 98 computer, this does not work. We recommend that you have a relatively new and updated computer with sufficient memory. In some cases its easier to not use Windows and for example use Linux via virtualbox, so your machine should have sufficient memory to comfortably run it. If you do not have such a machine we are at this time trying to get virtual machines that you can use on our cloud. However, runtime of these VMs is limited to 6 hours and they will be terminated after that. Naturally you can run new VMs. This is done in order to avoid resource “hogging” of idle VMs. In contrast to AWS you are not paying for our VMs so we enforce a rule to encourage proper community spirit while not occupying resources that could be used by others. Certainly you can naturally also use AWS or other clouds where you can run virtual machines, but in that case you need to pay for the usage yourself.

Please remember that this course does not have a required text books and the money you safe on this you can be used to buy a new or upgrade your current computer if needed.

Learning Outcomes

Students will gain broad understanding of Big Data application areas and approaches used. This course is a good preparation for any student likely to be involved with Big Data in their future.

Grading

Grading for homework will be done within a week of submission on the due date. For homework that were submitted beyond the due date, the grading will be done within 2-3 weeks after the submission. A 10% grade reduction will be given. Some homework can not be delivered late (which will be clearly marked and 0 points will be given if late; these are mostly related to setting up your account and communicating to us your account names.)

It is the student’s responsibility to upload submissions well ahead of the deadline to avoid last minute problems with network connectivity, browser crashes, cloud issues, etc. It is a very good idea to make early submissions and then upload updates as the deadline approaches; we will grade the last submission received before the deadline.

Note that paper and project will take a considerable amount of time and doing proper time management is a must for this class. Avoid starting your project late. Procrastination does not pay off. Late Projects or term papers will receive a 10% grade reduction.

  • 40% Homework
  • 50% Term Paper or Project
  • 10% Participation/Discussion

√ Details about the assignments can be found in the Section Homework.

Academic Integrity Policy

We take academic integrity very seriously. You are required to abide by the Indiana University policy on academic integrity, as described in the Code of Student Rights, Responsibilities, and Conduct, as well as the Computer Science Statement on Academic Integrity (http://www.soic.indiana.edu/doc/graduate/graduate-forms/Academic-Integrity-Guideline-FINAL-2015.pdf). It is your responsibility to understand these policies. Briefly summarized, the work you submit for course assignments, projects, quizzes, and exams must be your own or that of your group, if group work is permitted. You may use the ideas of others but you must give proper credit. You may discuss assignments with other students but you must acknowledge them in the reference section according to scholarly citation rules. Please also make sure that you know how to not plagiarize text from other sources while reviewing citation rules.

We will respond to acts of plagiarism and academic misconduct according to university policy. Sanctions typically involve a grade of 0 for the assignment in question and/or a grade of F in the course. In addition, University policy requires us to report the incident to the Dean of Students, who may apply additional sanctions, including expulsion from the university.

Students agree that by taking this course, papers and source code submitted to us may be subject to textual similarity review, for example by Turnitin.com. These submissions may be included as source documents in reference databases for the purpose of detecting plagiarism of such papers or codes.

Instructors

The course presents lectures in online form given by the instructors listed bellow. Many others have helped making this material available and may not be listed here.

For this class support is provided by

  • Gregor von Laszewski (PhD)
  • Badi’ Abdul-Wahid (PhD)
  • Jerome Mitchell (Teaching Assistant)
  • Prashanth Balasubramani (Teaching Assistant)
  • Hyungro Lee (Teaching Assistant)

Dr. Gregor von Laszewski

_images/gregor2.png

Gregor von Laszewski is an Assistant Director of Cloud Computing in the DSC. He held a position at Argonne National Laboratory from Nov. 1996 – Aug. 2009 where he was last a scientist and a fellow of the Computation Institute at University of Chicago. During the last two years of that appointment he was on sabbatical and held a position as Associate Professor and the Director of a Lab at Rochester Institute of Technology focussing on Cyberinfrastructure. He received a Masters Degree in 1990 from the University of Bonn, Germany, and a Ph.D. in 1996 from Syracuse University in computer science. He was involved in Grid computing since the term was coined. He was the lead of the Java Commodity Grid Kit (http://www.cogkit.org) which provides till today a basis for many Grid related projects including the Globus toolkit. Current research interests are in the areas of Cloud computing. He is leading the effort to develop a simple IaaS client available at as OpenSource project at http://cloudmesh.github.io/client/

His Web page is located at http://gregor.cyberaide.org. To contact him please send mail to laszewski@gmail.com. For class related e-mail please use PIazza for this class.

In his free time he teaches Lego Robotics to high school students. In 2015 the team won the 2nd prize in programming design in Indiana. If you like to volunteer helping in this effort please contact him.

He offers also the opportunity to work with him on interesting independent studies. Current topics include but are not limited to

  • cloudmesh
  • big data benchmarking
  • scientific impact of supercomputer and data centers.
  • STEM and other educational activities while using robotics or big data

Please contact me if you are interested in this.

Dr. Geoffrey Fox

_images/gcf.jpg

Fox received a Ph.D. in Theoretical Physics from Cambridge University and is now distinguished professor of Informatics and Computing, and Physics at Indiana University where he is director of the Digital Science Center, Chair of Department of Intelligent Systems Engineering and Director of the Data Science program at the School of Informatics and Computing. He previously held positions at Caltech, Syracuse University and Florida State University after being a postdoc at the Institute of Advanced Study at Princeton, Lawrence Berkeley Laboratory and Peterhouse College Cambridge. He has supervised the PhD of 68 students and published around 1200 papers in physics and computer science with an index of 70 and over 26000 citations. He currently works in applying computer science from infrastructure to analytics in Biology, Pathology, Sensor Clouds, Earthquake and Ice-sheet Science, Image processing, Deep Learning, Manufacturing, Network Science and Particle Physics. The infrastructure work is built around Software Defined Systems on Clouds and Clusters. The analytics focuses on scalable parallelism.

He is involved in several projects to enhance the capabilities of Minority Serving Institutions. He has experience in online education and its use in MOOCs for areas like Data and Computational Science. He is a Fellow of APS (Physics) and ACM (Computing).

Dr. Badi’ Abdul-Wahid

_images/badi.png

Badi’ received a Ph.D. in Computer Science at the University of Notre Dame under Professor Jesus Izaguirre. The primary focus of his graduate work was the development of scalable, fault-tolerant, elastic distributed applications for running Molecular Dynamics simulations.

At Indiana University, Badi’ works with the FutureSystems project on a NIST-funded study whose goal is to understand patterns in the development and usage of Big Data Analysis pipelines.

Teaching Assistants

Hyungro Lee

_images/Hyungro.jpg

Hyungro Lee is a PhD candidate in Computer Science at Indiana University working with Dr. Geoffrey C. Fox. Prior to beginning the PhD program, Hyungro worked as a software engineer in the Cyworld Group (social networking platform in South Korea) at SK Communications, developing communications platforms including emails, texts and messaging at large scale to support over 40 million users. From this work he developed an interest in how distributed systems achieve scalability and high availability along with managing resources efficiently. He is currently working on the FutureSystems project to support Big Data Analysis Software Stacks in Virtual Clusters. He was also working on the FutureGrid project, an NSF funded significant new experimental computing grid and cloud test-bed to the research community, together with user supports. His research interests are parallel and distributed systems, and cloud computing

Jerome Mitchell

_images/jerome.jpg

Jerome Mitchell is a Ph.D candidate in computer science at Indiana University and is interested in coupling the fields of computer and polar science. He has participated in the United State Antarctic Program, (USAP), where he collaborated with a multidisciplinary team of engineers and scientists to design a mobile robot for harsh polar environments to autonomously collect ice sheet data, decrease the human footprint of polar expeditions, and enhance measurement precision. His current work include: using machine learning techniques to help polar scientists identify bedrock and internal layers in radar imagery. He has also been involved in facilitating workshops to educate faculty and students on the importance of parallel and distributed computing at minority-serving institutions.

Prashanth Balasubramani

_images/Prashanth.jpg

Prashanth Balasubramani is an MS student in Computer Science at Indiana University working with Gregor von Laszewski, Assistant Director of Cloud Computing at DSC. He has been working under Professor Gregor and Dr.Geoffrey Fox for the past year as an Associate Instructor for the course Big Data Analytics and Applications during the Fall 2015 and Spring 2016 semesters. Before joining Indiana University, he worked as a ETL developer for Capital One Banking firm (Wipro Technologies, Bangalore) developing Hadoop MR and Spark jobs for real time migration of Historical Data into virtual clusters on the Cloud. He is currently working as an Teaching Assistant for the Big Data Applications and Analytics course for the Fall 2016 semester. He is also working on NIST benchmarking project for recording benchmarks on different cloud platforms His research interests include Big Data applications, Cloud computing and Data Warehousing.

Updates

This page is conveniently managed with git. The location for the changes can be found at

The repository is at

Issues can be submitted at

Or better use piazza so you notify us in our discussion lists. If you detect errors, you could also create a merge request at