IDS 572: Data Mining for Business

 

Fall 2006, Call 24370, Wednesday 3:00-5:30 PM, DH210

Yair M. Babad, UH 2403, Phone 312-996-8094, Cell 310431-6729, Fax 312-413-0385

e-mail: ybabad@uic.edu, URL: http://www.uic.edu/~ybabad

Office Hours Wednesday 2:00-3:00 PM

 

Updated: 8/29/2006 20:09:12

 

 

COURSE OBJECTIVE & PHILOSOPHY

 

One of the most profound results of the information technology revolution is the explosion in data and information availability. Effective use of this information, including the operational use of the information and its use for prediction, planning and control, is for many organizations a critical need. This course is devoted to the discovering of meaningful patterns in the data so that it can be effectively used in business intelligence. Data Mining (DM) is a user-centric, interactive process that leverages analytical and statistical technologies and computing power. It is widely used in business, e.g., for Customer Relationships Management (CRM), market research, and credit scoring, as well as for industrial quality assurance.

 

This course is not a course in statistics, nor is it a course in information technology. It is a survey course of the techniques and tools used to extract meaning and gleam useful patterns from available data. The objective is to increase your awareness level to these and make them an indispensable element of your professional “personality”.

 

 

TEXTBOOKS AND OTHER SUPPORT MATERIAL

 

Two Wiley books by Michael J. A. Berry and Gordon S. Linoff are the required texts for the course: Data Mining Techniques for Marketing, Sales and Customer Support, 2nd Edition (2004, ISBN 0-471-47064-3) [DMT], and Mining the Web: Transforming Customer Data into Customer Value (2001, ISBN 0-471-41609-6) [Web].

 

A recommended text is Data Mining: Concepts and Techniques, 2nd Edition by J. Han and M. Kamber (2006, Morgan Kaufman Publishers, ISBN 1-55860-901-6) [Han].

 

To demonstrate some of the discussed techniques, we will use Clementine, a data mining package by SPSS. A personal copy, the “Graduate Pack”, is also available from SPSS for less than $200 (of course, it is not a required resource; a copy of Clementine is available either in the PC Lab or in UH 2401). Clementine Users Guide and Clementine Node Reference and the related CRISP‑DM manual are available on my website.

 

Other worthwhile resources, would you be interested, include:

 

For additional resources, look at the DMT web site at www.data-miners.com/companion, and at http://www.data-miners.com/resources/suggested.html.

 

My web page has PowerPoint presentations for all the material that I will introduce in class. These summarize the contents of the textbook, in addition to other material that will be discussed in class. You are advised to print these presentations (probably with 3 or 6 slides per page, framed, in black and white printing format) prior to class, so that you can use them in class in lieu of notes. You are responsible for knowing the contents of these transparencies as well as the textbook’s material (and of course whatever is discussed in class).

 

 

COMMUNICATIONS & PREREQUISITES

 

I believe that open communications channels between all of us add significantly to the value of the class. You are welcome to contact me – preferably via e-mail. In particular, ALL questions and comments are welcome. All communications between us will use electronic mail. The assignments and other course materials can be printed out from the World Wide Web, at my URL given above.

 

All assignments and other submissions sent to me will have a filename in the format 572_AssignmentDescription_LastName_MMDDYY.extension, where “MMDDYY” is the submission date. Similarly, all e-mail message to me should have as the subject line 572_LastName_SubjectDescription.

 

The approach taken in this course is pragmatic, rather than theoretical or technical, with the objective of increasing your familiarity with the course topics on the one hand, and your critical understanding of the material on the other. I do not intend to "read the text in class". Rather, I will emphasize certain issues, and will respond to your questions. You must read on your own and be familiar IN ADVANCE OF EACH CLASS with the assigned material as given in the schedule, and with the class notes available in my web page. The course will be discussion oriented, with emphasis on discussions geared to the case studies at the end of each chapter.

 

A common theme in my courses is the development of your communications skills and use of available computer technology and common software tools. Assignments should all be typed (using computerized office tools) and be professionally presentable; hand-written assignments will not be graded. Your work must follow the standards specified in the PRESHINT.DOC file in my web site. You are expected to submit your work using word-processing and spreadsheet tools.

 

All homework will be submitted electronically via e-mail. It must be in my reader by midnight Monday preceding the class in which it must be submitted, at the latest. Assignment due-dates as given above or in class will be strictly adhered to and late assignments will not be accepted, unless prearranged with me. Virus infected submissions will be deleted and not graded with no opportunity for resubmission.

 

I maintain a web page for this class. To this end, get to my URL listed above, select this class, and you will find yourself in an "announcement file" for this course. This file includes references to related documents, such as this syllabus, homework, and PowerPoint presentation of class material, in addition to the latest announcements related to the class.

 

The course assumes that different students have different levels of understanding and background of the course's topics, yet we will present the topics at advanced level. Students with little familiarity of the material are expected to prepare themselves to fully understand the material and contribute to course work and discussions. You are always welcome to discuss this (and all other issues) with me.

 

 

ASSIGNMENTS, QUIZZES AND EXAMS

 

Assignments will usually be based on the case studies at the end of the text's chapters, and will be announced in class. Homework solutions will be discussed in class at the date they are due; therefore, late submissions of homework assignments will not be accepted. Note that homework will be based, to a large extent, on material you are supposed to read for the next class, and will be discussed in class only after you submit the homework, in order to let you exercise your own judgment and understanding.

 

There will be a team-oriented data mining course project. The project will include data collection and scrubbing, model building, and data analysis and presentation. The project, and its various segments, will be discussed also during classes. Each team will provide a final project report, in addition to intermediate reports at the end of various segments of the project. The last class will be devoted to presentations of the projects. Following that, a public presentation will be made to the public and to members of the Center for Research in Information Management (CRIM) [it is a required element of the course].

 

There will be no exams in this course. Rather, each class session (except the first one) may include a brief open book quiz, which stress understanding of the required reading material and the material covered in the last class. This system allows timely grade progress feedback, and motivates to prepare for each session (and thus increase the probability of quality participation and getting the most from the class sessions).

 

 

CLASS ATTENDANCE

 

You are expected to attend all classes, and are responsible for all announcements made in class or in the announcement file. Makeup of quizzes or reports will be given only by approval PRIOR to the quiz or report, except for extreme circumstances. Punctuality is highly regarded; no student, if arriving late, will be given any extra time to complete a quiz, nor will makeup quizzes be offered.

 

The university's honor code will be adhered to. Submitted reports and homework may from time-to-time be checked for plagiarism. Cheating, plagiarism or copying will result in an automatic failing grade for the problem, quiz, exam or project for all those participating in  the cheating or copying, and may lead to a failing grade in the course for all those students who are deemed to have consciously contributed to the cheating. To help you in maintaining the anti-plagiarism policy, you will be required to submit all your homework and reports to TurnItIn, a plagiarism assessment program, from which I will download your homework and reports. Note also that since I will be downloading this material only once a week, you must adhere to the submission timing requirements.

 

 

GRADING

 

Grades will be based on homework assignments and quizzes (equally weighted, and possibly dropping the worst assignment and/or quiz), as well as the project. The homework and quizzes will weigh 50% of the final grade (except that no one of them, except for the Risk homework which will get 15% of the final grade, will be given more than 5% of the final grade; if there will be insufficient number of homework and quizzes the allocation of percentages between the projects and homework will be adjusted), and the project 50%. Final grades will be assigned on a curve, and I will exercise my judgment as to the cut points, as well as to the grading of students who miss or come late to many of the classes.

 

Don't nitpick about the grading. Persons who complain will not be rewarded for it; those who have the decency not to complain would deserve the same break. A request to look at one problem leads to re-grading of the whole paper, which often leads to a lower grade.

 

No "extra credit" opportunities will be offered or assigned to specific individuals under any circumstances; all students' grades will be based on the same components - this is an equal opportunity course.

TENTATIVE & APPROXIMATE COURSE SCHEDULE

(actual schedule will be determined by class advancement, and changes will be announced)

 

Class Number

Class Date

Topic

Chapter (in week topic started)

1

Aug 30

Introduction, DM Applications

DMT 1, 4; Web 1; Han 1

2

Sep 6

DM Methodology, CRISP-DM, Course Projects

DMT 2, 3

3

Sep 13

Cont.

 

4

Sep 20

Preparing Data for Mining, Clementine Overview

DMT 17; Han 2

5

Sep 27

Clementine Overview – cont.

Project – Initial Understanding Presentation

 

6

Oct 4

DM and Statistics, Hazard Functions, Survival Analysis

DMT 5, 12; Web 8

7

Oct 11

Hazard Functions, Survival Analysis – cont.

Project – Final Understanding Presentation

Risk Homework (submission Monday midnight before class 10)

 

8

Oct 18

Memory Based Reasoning, Market Basket Analysis, Link Analysis

DMT 8-10; Han 5

9

Oct 25

Decision Trees, Clustering

Project – Model Presentation

DMT 6, 11; Han 6-7

10

Nov 1

Cont.

Risk – Presentation

 

11

Nov 8

Neural Networks, Genetic Algorithms

DMT 7, 13

12

Nov 15

Mining Stream, Data Series and Sequences

Project – Evaluation Presentation

Han 8

13

Nov 22

DM and CRM

DMT 14; Web 7

14

Nov 29

Privacy and Societal Issues, the DM Environment, Putting DM to Work

Project – Final Report submission

DMT 15, 16, 18; Han 11

15

Dec 6

Project Presentations in Class

 

 

Thu, Dec 7

Project presentation in the CRIM Students Projects Exposition

 

***

 

*** Exams Week - No Final Exam ***