Mgt 4322 - Home   Spring 2010   Email to Dr. Lyons     PatLyons Home
[ Calendar10:10 | PortWebsite | StuWebs  | Port1 | 2 | 3 | 4 | 5 | 6 | Showcase | Presentation ]
[ Ch 1 || 3 || 5 || 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 || AppDatabase | Ex1 || 3 | 4 | 5 | 6 | 7 ]
[ HW 1| 2 | 3 | 4 |
| Career1| 2 | 3 ] [
SJU | TCB | CareerCtr | StuInfo | CareerLinks ] [ SJU Closing ] [H1N1SelfAssessment]


  Ch 6 - Foundations of Business Intelligence: Databases and Information Management

  1. Opening Case - Hewlett-Packard      (p207)
    1. Problem: Mark Hurd (CEO of HP) had difficulty collecting "consistent, timely data spanning different parts of HP".  HP had 14,000 databases.
    2. Solution: a data warehouse with a single global enterprise-wide database.  HP can now track marketing across all business units, by media sector, customer segment, and country.
    3. Illustrates benefit of a data warehouse - HP management now makes better spending decisions.
       
  2. File Organization Concepts     (p209)
    1. Def - Bit (binary digit) - is either 0 or 1. Binary digits are a basic unit of information storage in digital computing.
      See http://en.wikipedia.org/wiki/Bit.  Bits can be represented by:
      1. A transistor conducting or not conducting,
      2. A capacitor charged or discharged,
      3. A spot on a CD-ROM that is either a pit (a small groove on the CD, which reflects laser light away) or a ground (a flat reflective surface, which reflects laser light back to a sensor), or
      4. A spot on a magnetic tape that is polarized in one of two states.
    2. Def - Byte - is a collection of bits, originally variable in size but now almost always eight bits.  Most characters are encoded as 1 byte.
      0100 1001 is the ASCII encoding of the letter I
      (p174)
    3. Def - Field - a grouping of characters.  For example, a student's ID or age
    4. Def - Record - a group of related fields.  For example, StudentID, Course, SemesterTaken, Grade
    5. Def - File - a group of records of the same type.  For example, a group of the above course records form the Course file.  Fig 6-1, p209.
    6. Def - Database - a group of related files.  For example, Course file, Financial file, and Personal file.  Fig 6-1.
    7. Def - Key - a field(s) that uniquely identifies a record.  For example, Supplier_Number in the Supplier table of Fig6-4, p214.
       
  3. Organizing Data in a Traditional File Environment     (p209)
    1. In the past, for many organizations, data files and systems tended to grow independently without a company-wide plan.  Fig 6-2, p210.
    2. Problems with Traditional File Environment
      1. Data Redundancy and Inconsistency - HR and finance departments may independently maintain files with employee info, such as number of dependents, and be inconsistent.
      2. Lack of Data Sharing - HR may need info for personnel planning available in finance department's files, but HR cannot access it.
         
  4. The Database Approach to Data Management     (p212)
    1. Def - Database Management System (DBMS) - software that permits an organization to centralize data, manage them efficiently, and provide access to the stored data by application programs.
      1. DBMS separates the logical view (how the data are perceived by the end users) from the physical view (how data are actually organized on physical storage media).
    2. Def - Relational Database - a database that represents data as tables (also called relations).
      1. For all practical purposes, a table is the same as a file, as defined above in II.E.
      2. It is not necessary for our purposes, but if you wish to read the details, see http://en.wikipedia.org/wiki/Relational_database and http://en.wikipedia.org/wiki/Relation_(mathematics)
    3. Def - Relational DBMS - a DBMS that creates relational databases.
      1. The most popular type of DBMSs today are relational DBMSs.
      2. Microsoft Access is a relational DBMS for PCs.
      3. IBM DB2, Oracle Database, and Microsoft SQL Server are relational DBMS for larger computers.
         
  5. Designing Databases     (p219)
    1. Normalization - a database design technique for relational database tables which eliminates data inconsistency.
    2. Definitions
      1. A record is unnormalized if it contains repeated groups of nonkey items.
      2. A record is in First Normal Form (1NF) if it has no repeated groups of nonkey items.
      3. A record is in Second Normal Form (2NF) if it is in 1NF and nonkey items are fully dependent on the entire key.
      4. A record is in Third Normal Form (3NF) if it is in 2NF and does not have transitive dependence.  Transitive dependence occurs when a nonkey item depends on the record key only via another nonkey item.
    3. Database designer's promise - I do solemnly promise that, when designing a database, I shall make every nonkey field depend on the key, the whole key, and nothing but the key, so help me Codd.  (The normalization process was first proposed by Edgar F. Codd, see http://en.wikipedia.org/wiki/E.F._Codd.)

    4. Normalization Process

      1. Consider the Order Record, with fields
        *Order #, Date of Order, Customer Name, Customer Address, Product #, Quantity Ordered, Unit Price,
             which may contain repeated groups of Product #, Quantity Ordered, and Unit Price.
             Note by PJL(2010Mar4): Unit Price depends only on Product #, not Order #.
             * denotes key.     (Order Record is unnormalized)

        *Order # Date of
        Order
        Customer
        Name
        Customer
        Address
        Product # Quantity
        Ordered
        Unit
        Price
        Product # Quantity
        Ordered
        Unit
        Price
        101 2/4/2007 Davis 22 Elm St. 44 2 1.00      
        102 2/6/2007 Garcia 14 Oak St. 44 3 1.00 64 4 3.00
        103 2/6/2007 Payne 31 Maple St. 37 1 2.00 64 1 3.00
        104 2/8/2007 Garcia 14 Oak St. 37 4 2.00      
      2. To convert an unnormalized record to 1NF, separate the original record into two or more new records, which do not have repeating groups.
        Order Record - *Order #, Date of Order, Customer Name, Customer Address

        *Order # Date of Order Customer Name Customer Address
        101 2/4/2007 Davis 22 Elm St.
        102 2/6/2007 Garcia 14 Oak St.
        103 2/6/2007 Payne 31 Maple St.
        104 2/8/2007 Garcia 14 Oak St.

        Sales Record - *Order #, *Product #, Quantity Ordered, Unit Price

        *Order # *Product # Quantity Ordered Unit Price
        101 44 2 1.00
        102 44 3 1.00
        102 64 4 3.00
        103 37 1 2.00
        103 64 1 3.00
        104 37 4 2.00
      3. To convert a 1NF record to 2NF, separate the 1NF record into new records, in which nonkey items are fully dependent on the entire key.
        Order Record - is in 2NF
        Sales Record - *Order #, *Product #, Quantity Ordered

        *Order # *Product # Quantity Ordered
        101 44 2
        102 44 3
        102 64 4
        103 37 1
        103 64 1
        104 37 4

        Product Record - *Product #, Unit Price

        *Product # Unit Price
        44 1.00
        37 2.00
        64 3.00
      4. To convert a 2NF record to 3NF, separate the 2NF record into new records, which do not have transitive dependence.
        (In Order Record, Customer Address is transitively dependent on the key only via another nonkey item, Customer Name.)
        Sales Record - is in 3NF
        Product Record - is in 3NF

        Order Record - *Order #, Date of Order, Customer Name

        *Order # Date of Order Customer Name
        101 2/4/2007 Davis
        102 2/6/2007 Garcia
        103 2/6/2007 Payne
        104 2/8/2007 Garcia

        Customer Record - *Customer Name, Customer Address

        *Customer Name Customer Address
        Davis 22 Elm St.
        Garcia 14 Oak St.
        Payne 31 Maple St.
      5. Resulting Database
        Order Record - *Order #, Date of Order, Customer Name
        Sales Record - *Order #, *Product #, Quantity Ordered
        Customer Record - *Customer Name, Customer Address
        Product Record - *Product #, Unit Price

    5. References for normalization - http://en.wikipedia.org/wiki/Database_normalization, en.wikipedia.org/wiki/Relation_(mathematics), support.microsoft.com/kb/283878

    6. Homework Problem: determine the 1NF, 2NF, and 3NF of the following Student Record, with fields
      *StudentID, StudentName, StudentHighSchool, HighSchoolEnrollment, CourseID, CourseTitle, Grade,
           which may contain repeated groups of CourseID, CourseTitle, and Grade.
           * denotes key.     (Student Record is unnormalized)

      *StudentID Student
      Name
      Student
      High School
      High School
      Enrollment
      CourseID Course
      Title
      Grade CourseID Course
      Title
      Grade
      101 Chan Molloy 2000 3325 OpsMgt A      
      102 Hill StFrancis 3000 2301 IntroMgt B 3325 OpsMgt B+
      103 Diaz HolyCross 500 2301 IntroMgt A- 4322 MIS B+
      104 Lewis StFrancis 3000 4322 MIS A      
    7. Solution to Homework Problem - Right click this link and use the SaveTargetAs option to download my solution, NormalStuRec.doc (74KB).
       
  6. Using Databases to Improve Performance     (p222)
    1. Def - Data Warehouse - a database of current and historical data of potential interest to decision makers throughout the company, Fig 6-13, p223.
      1. The corresponding data warehouse system provides tools for improved Business Intelligence, explained below.
    2. Tools for Business Intelligence include:
      1. Standard database query and reporting tools.
      2. Online Analytical Processing (OLAP) - provides answers to analytical queries that are dimensional in nature.  Fig 6-15, p227.  How many washers were sold in the East region in June, and how does it compare with the sales forecast for August?
        See http://en.wikipedia.org/wiki/Olap.
      3. Data mining - is the process of finding patterns and relationships in large databases using computational techniques from pattern recognition and machine learning.  See http://en.wikipedia.org/wiki/Data_mining.
        Types of information found include:
        1. Associations - occurrences linked to a single event - also called Market Basket Analysis - try to detect consumer buying habits and formulate marketing plans accordingly.
          Example - men in their 20s who purchase beer on Fridays after work are also likely to buy a pack of disposable diapers (for their young children). Thus, a merchant may put beer on sale on Fridays and sell the diapers at full price.
          See http://web.onetel.net.uk/~hibou/Beer and Nappies.html.
        2. Sequences - events linked over time.
          Example - if a new house is purchased, a new refrigerator is purchased within two weeks 65% of the time.
        3. Classifications - placing an individual item in one of several predefined groups based on a training set of previously classified items.  Called supervised learning.
          Example - classifying an existing credit card as being in a state of fraudulent use.
        4. Clusterings - similar to classification, but there are no predefined groups.  Call unsupervised learning.
          Examples - bank finding characteristics of affinity groups for credit cards, or
          Intrawest finding 7 ski customer categories for marketing (see Ch 2, III.C.3).
        5. Forecasting - forecasting a future value of a continuous variable based on a training set of previous cases.  Often uses a neural network.  See p433.
          Example - forecasting production yield, based on two variables, such as temperature and pressure.  See http://www.patlyons.com/research/NeuralNets.htm.
           
  7. Ensuring Data Quality     (p233)

    1. Gartner Group reports:
      1. More than 25% of critical data in Fortune 1000 companies' databases is inaccurate or incomplete.
      2. Customer data degrades 2% per month.
    2. Def - Data Cleansing - activities that detect and correct incorrect, incomplete, improperly formatted, or redundant data.
    3. Def - Check Digit - an extra digit appended to a numerical record field to spot errors, such as the transposition of two adjacent digits.
      1. Example (for your info) - the final digit of a Universal Product Code is a check digit computed as follows:
        Add the digits in the odd-numbered positions (first, third, fifth, etc.) together and multiply by three.
        Add the digits in the even-numbered positions (second, fourth, sixth, etc.) to the result.
        Subtract the result from the next-higher multiple of ten. The answer is the check digit.
        For instance, a UPC barcode for a box of tissues is "03600029145X" where X is the check digit.
        X can be calculated by adding the odd-numbered digits (0+6+0+2+1+5 = 14), multiplying by three (14 × 3 = 42), adding the even-numbered digits (42+3+0+0+9+4 = 58) and subtracting from the next-highest multiple of ten (60 - 58 = 2).
        The check digit is thus 2 and the full UPC is 036000291452.
        This scheme will spot the transposition error of any two adjacent digits.
        See http://en.wikipedia.org/wiki/Check_digit.

    4. Def - Data Administration - an organizational function for managing an organization's information policy, data planning, and data quality standards (p233).

      1. Several firms offer data administration services, such as DataFlux - www.dataflux.com,
        PitneyBowesBusinessInsight - www.pbinsight.com, and Trillium Software - www.trilliumsoftware.com.
                           (This page was last edited on March 04, 2010 .)