paper link https://www.microsoft.com/en-us/research/wp-content/uploads/2016/07/ICSE-2016-1-iDice-Problem-Identification-for-Emerging-Issues.pdf
Backgroud:
For clould machine problemd identification, there may be many event happens before an issue.If we want to find the events combination which contribute to the emerging Issue, we may have plenty of combination candicates because there could be many attributes and in each attribute there could be many vaules under it.So find effective combinations in an efficient way in difficult.
Aim
Identify the effective combinations that isolate the emerging issues.
Challenge
Two challenged problem while choosing combinations from attributes with values:
- Inefficient : attribute(each attribute may have many values) too many -> combinations to be explosive
- Ineffective: The effective combinations may be missed. The res combinations may contain redundancy and overlap
What combination we need
Effective combination should have following two features:
- Impactful: combinations should associate with large number of emerging issues
- Reflecting changes: should exhibit a significant burst after the change point
solution
This paper propose a new method to identify effective combinations from all possible combinations which named iDice.
If a superset is an effective combination then it’s subsets are also effective combination. On the contrary , it is not established(Figure 2).
After we established a tree of attribute combinations ,we should pruning by following steps(Figure 3):
-
Impact based pruning(combination should impact large number of customers) ,
Only consider the attribute sets associated with large volume of issue reports, and pruning off those without high enough volume -
Change detection based pruning(combinations should exhibit changes during a time series)
Use GLR algorithm to detect whether there is a change during time series with certain combinations ,if not ,pruning it -
Isolation power based pruning (remove redundancy)
Propose isolation power IP(X),
:time series data corresponding to the attribute combination X
:denotes the volume of time series data in SX during the change region of X
: Xb denotes the volume of time series data in SX before the change point of X
: the entire volume during the change region of X
:denotes the entire volume before the change point of X
: denote the mean value of the correspond in time series.
the IP(x) is a re-definition of information entropy with the concept of change point, If a combination X is effective ,the IP(X) should less than both X’s superset and subset