Data Mining by Mehmed Kantardzic (good book recommendations TXT) π
Read free book Β«Data Mining by Mehmed Kantardzic (good book recommendations TXT) πΒ» - read online or download for free at americanlibrarybooks.com
- Author: Mehmed Kantardzic
Read book online Β«Data Mining by Mehmed Kantardzic (good book recommendations TXT) πΒ». Author - Mehmed Kantardzic
Link analysis is an important field that has received a lot of attention recently when advances in information technology enabled mining of extremely large networks. The basic data structure is still a graph, only the emphasis in analysis is on links and their characteristics: labeled or unlabeled, directed or undirected. There is an inherent ambiguity with respect to the term βlinkβ that occurs in many circumstances, but especially in discussions with people whose background and research interests are in the database community. In the database community, especially the subcommunity that uses the well-known entity-relationship (ER) model, a βlinkβ is a connection between two records in two different tables. This usage of the term βlinkβ in the database community differs from that in the intelligence community and in the artificial intelligence (AI) research community. Their interpretation of a βlinkβ typically refers to some real world connection between two entities. Probably the most famous example of exploiting link structure in the graph is the use of links to improve information retrieval results. Both, the well-known PageRank measure and hubs, and authority scores are based on the link structure of the Web. Link analysis techniques are used in law enforcement, intelligence analysis, fraud detection, and related domains. It is sometimes described using the metaphor of βconnecting the dotsβ because link diagrams show the connections between people, places, events, and things, and represent invaluable tools in these domains.
12.2 TEMPORAL DATA MINING
Time is one of the essential natures of data. Many real-life data describe the property or status of some object at a particular time instance. Today time-series data are being generated at an unprecedented speed from almost every application domain, for example, daily fluctuations of stock market, traces of dynamic processes and scientific experiments, medical and biological experimental observations, various readings obtained from sensor networks, Web logs, computer-network traffic, and position updates of moving objects in location-based services. Time series or, more generally, temporal sequences, appear naturally in a variety of different domains, from engineering to scientific research, finance, and medicine. In engineering matters, they usually arise with either sensor-based monitoring, such as telecommunication control, or log-based systems monitoring. In scientific research they appear, for example, in spatial missions or in the genetics domain. In health care, temporal sequences have been a reality for decades, with data originated by complex data-acquisition systems like electrocardiograms (ECGs), or even simple ones like measuring a patientβs temperature or treatment effectiveness. For example, a supermarket transaction database records the items purchased by customers at some time points. In this database, every transaction has a time stamp in which the transaction is conducted. In a telecommunication database, every signal is also associated with a time. The price of a stock at the stock market database is not constant, but changes with time as well.
Temporal databases capture attributes whose values change with time. Temporal data mining is concerned with data mining of these large data sets. Samples related with the temporal information present in this type of database need to be treated differently from static samples. The accommodation of time into mining techniques provides a window into the temporal arrangement of events and, thus, an ability to suggest cause and effect that are overlooked when the temporal component is ignored or treated as a simple numeric attribute. Moreover, temporal data mining has the ability to mine the behavioral aspects of objects as opposed to simply mining rules that describe their states at a point in time. Temporal data mining is an important extension as it has the capability of mining activities rather than just states and, thus, inferring relationships of contextual and temporal proximity, some of which may also indicate a causeβeffect association.
Temporal data mining is concerned with data mining of large sequential data sets. By sequential data, we mean data that are ordered with respect to some index. For example, a time series constitutes a popular class of sequential data where records are indexed by time. Other examples of sequential data could be text, gene sequences, protein sequences, Web logs, and lists of moves in a chess game. Here, although there is no notion of time as such, the ordering among the records is very important and is central to the data description/modeling. Sequential data include:
1. Temporal Sequences. They represent ordered series of nominal symbols from a particular alphabet (e.g., a huge number of relatively short sequences in Web-log files or a relatively small number of extremely long gene expression sequences). This category includes ordered but not time stamped collections of samples. The sequence relationships include before, after, meet, and overlap.
2. Time Series. It represents a time-stamped series of continuous, real-valued elements (e.g., a relatively small number of long sequences of multiple sensor data or monitoring recordings from digital medical devices). Typically, most of the existing work on time series assumes that time is discrete. Formally, time-series data are defined as a sequence of pairs T = ([p1, t1], [p2, t2], β¦ , [pn, tn]), where t1 < t2 < β¦ < tn. Each pi is a data point in a d-dimensional data space, and each ti is the time stamp at which pi occurs. If the sampling rate of a time series is constant, one can omit the time stamps and consider the series as a sequence of d-dimensional data points. Such a sequence is called the raw representation of the time series.
Traditional analyses of temporal data require a statistical approach because of noise in raw data, missing values, or incorrect recordings. They include (1) long-term trend estimation, (2) cyclic variations, for example, business cycles, (3) seasonal patterns, and (4) irregular movements representing outliers. Examples are given in Figure 12.15. The discovery of relations in temporal data requires more emphasis in a data-mining process on the following three steps: (1) the representation and modeling of the data sequence in a suitable form; (2) the definition of similarity measures between sequences; and (3) the application of variety of new models and representations to the
Comments (0)