Data Mining And Business Intelligence (2170715)

BE | Semester-7   Winter-2018 | 03/12/2018

Q3) (b)

Explain the following as attribute selection measure: (i) Information Gain (ii) Gain Ratio

Information gain :

  • ID3 uses information gain as its attribute selection measure.
  • This measure is based on pioneering work by Claude Shannon on information theory, which studied the value or “information content” of messages
  • Let node N represents or hold the tuples of partition D.
  • The attribute with the highest information gain is chosen as the splitting attribute for node N
  • This attribute minimizes the information needed to classify the tuples in the resulting partitions and reflects the least randomness or “impurity” in these partitions
  • Such an approach minimizes the expected number of tests needed to classify a given tuple and guarantees that a simple (but not necessarily the simplest) tree is found.

Gain ratio :

  • The information gain measure is biased toward tests with many outcomes
  • That is, it prefers to select attributes having a large number of values.
  • For example, consider an attribute that acts as a unique identifier, such as product ID
  • A split on product ID would result in a large number of partitions (as many as there are values), each one containing just one tuple.
  • Because each partition is pure, the information required to classify data set D based on this partitioning would be Infoproduct_ID(D) = 0
  • Therefore, the information gained by partitioning on this attribute is maximal.
  • Clearly, such a partitioning is useless for classification
  • C4.5, a successor of ID3, uses an extension to information gain known as gain ratio, which attempts to overcome this bias.