Tải bản đầy đủ - 0 (trang)
A.3 Visualization, Visual Mining, and Reporting

A.3 Visualization, Visual Mining, and Reporting

Tải bản đầy đủ - 0trang

A.3 Visualization, Visual Mining, and Reporting



335



Table A.6 Visualization tools: R, HighCharts, Tableau Public

Availability

Link, url

Existing documentation such as white

papers

Licensing

Technical criteria

Operating system

Supported data formats

Extensibility

User interfaces

Evaluation

Functionality

Algorithms,

techniques,

visualizations

Data export/import,

interfaces

Data preprocessing

Interactivity

Community,

e.g.,

forum, blog



R-graphics



HighChart



Tableau Public



[5]

On the website, [9]



[34]

On the website tutorial and publications



[35]

On the website tutorial



GPL



Creative commonsNonCommercial



Free



Linux, Mac OS,

Unix, Windows

csv, excel



Javascript,jQuery,

HTTP-Server

csv, excel, json, xml



Windows, Mac OS X

csv, excel



Command line



JavaScript



GUI



Statistical graphics,

dynamic graphics for

data exploration

Interface to all DB



Interactive graphics,

dashboards



Interactive graphics,

dashboards



Yes



Export to JPEG, Web

PNG, pdf, SVG

Yes

Yes

Yes

Yes

Yes

Yes

All supported by a variety of documentations and fora



tion. For personal use and nonprofit organizations, high chart is freely available. An

open-source tool for reporting and infographics is Tableau Public. Tableau Public

has an easy-to-use interface and proposes a data visualization after parsing the

uploaded data. Afterwards, the user can customize this basic layout in drag-anddrop style. The produced infographic can be published on the Web.

From the commercial products for visualization of cross-sectional data, we want

to mention the SAS data mining software and the IBM SPSS Modeler which

integrate visualization in the data mining activities. For an overview on R-graphics,

HighChart, and Tableau Public see Table A.6.

There are numerous tools for Web-based graphics and infographics. Table A.7

lists ManyEyes, Gapminder, and Piktochart.

ManyEyes is an advanced visualization tool from IBM. The main emphasis is

on sharing graphics within the ManyEyes community. Users can create their own

graphics in easy steps or modify the graphics of other community members.

Gapminder is based on the Trendanalyzer software developed for the animated

presentation of statistics, so-called motion charts. These charts show impressively

the development of demographic, economic, or environmental facts. Many time



336



A Survey on Business Intelligence Tools



Table A.7 Web based visualizations: ManyEyes, Gapminder, Piktochart

Availability

Link, url

Existing documentation such as white

papers

Licensing



Technical criteria

Operating system

Supported data formats

Extensibility

User interfaces

Evaluation

Functionality

Algorithms,

techniques,

visualizations

Data export/import,

interfaces

Data preprocessing

Interactivity

Community, e.g.,

forum, blog



ManyEyes



Gapminder



Piktochart



[36]

Not much documentation available, introduction see [8]

Free,

data

and

visualizations

are

directly

shared,

copyright should be

cleared



[37]

On website



[38]

On website



Free



Free



Google spreadsheet



csv, spreadsheet



Web browser

csv, spread sheet



Interactive graphical user interface

Various basic layouts for graphics



Web publishing or download

Limited

Interactive editing of visualization

All supported by a community



series at the national level as well as from international organizations are available

on the site. The Trendanalyzer software is now available as interactive chart in the

Google spreadsheet. This allows users the creation of motions charts with their own

data.

Piktochart is an easy-to-use tool for creation of infographics. Numerous templates for infographics are available which can be adapted by the user. The created

infographics allow interactive elements and are readable by search engines.

In addition, there are several other tools that enable the creation of infographics,

e.g.:

• http://www.hongkiat.com/blog/infographic-tools/

• http://www.coolinfographics.com/tools/

• http://www.fastcodesign.com/3029239/infographic-of-the-day/30-simple-toolsfor-data-visualization



A.4 Data Mining



337



A.4 Data Mining

In this book, we used R for data mining applications. Strictly speaking, R is a

programming language for statistical computing and statistical graphics. It is a

popular data mining tool for scientists, researchers, and students. Consequently,

there exists a large community with fora and blogs which helps to learn how to

use the numerous packages necessary for data mining. Besides data mining, a rich

set of statistical methods for data preparation and graphics is available. R has strong

object-oriented programming facilities which allow the extension of the software as

soon as one has mastered the R language.

For usage of R as a BI production tool, the package DBI offers an interface

to relational database systems. For big data, a number of solutions are provided.

The package data.table is a fast tabulation tool as long as the data fit in the

memory, e.g., 100 GB in RAM. For using Hadoop and the MapReduce approach, a

number of packages have to be installed. For details, we refer to [4]. For a number

of algorithms, there are also parallel implementations available.

From a more practical point of view, big data problems can be handled by

sampling data from a database, develop a decision rule for the sample, and deploy

the learned rule afterwards in the database. Thus, R can be used as an analysis tool in

connection with an analytical sandbox. Alternatively, many a time it may be useful

to aggregate the data and analyze the aggregated data.

Basically R is command line oriented, but a number of GUIs exist. For the

development, RStudio offers an IDE, for data mining the Rattle GUI can be used,

and Revolution Analytics provides a visual studio-based IDE. Further, the RWeka

interface facilitates the application of Weka data mining algorithms in within R.

Weka is a Java-based data mining software which offers analysis tools similar

to R. It also provides numerous data preprocessing techniques. With respect to

data visualization, the facilities are not so comprehensive. The main user interface

of Weka is the Explorer which provides in several panels access to the different

data mining tasks. There exist panels for preprocessing, for variable selection, for

visualization, and for different data mining techniques like classification, clustering,

or association analysis. Weka supports two other BI tools: the Pentaho Business

Analytics Platform uses Weka for data mining and predictive analytics; inside ProM

Weka can be used for data mining, for example, in decision point analysis.

As a third open-source data mining software, we want to mention RapidMiner.

Due to the fact that it has an easy-to-use interface, it is one of the most popular

data mining tools in BI. It captures the entire life cycle of a BI application, allows

model management, and is well designed for the collaboration between the business

analyst and the data scientist. With respect to analysis capacities, it offers algorithms

for data preparation and for analysis. Algorithms from external sources like R or

Weka can be included in the analysis. Further, it supports the analysis of data in the

memory, in databases, in the cloud, and supports Hadoop. For an overview on R,

RapidMiner, and Weka see Table A.8.



338



A Survey on Business Intelligence Tools



Table A.8 Data mining tools: R, RapidMiner, Weka

Availability

Link, url

Existing

documentation

such as white

papers

Licensing

Technical criteria

Operating system

Supported data

formats

Extensibility

User interfaces

Evaluation



Functionality

Algorithms, techniques, visualizations

Data

export/import,

interfaces

Data preprocessing

Interactivity

Community, e.g.,

forum, blog



R



RapidMiner



Weka



[5]

On the website:

manuals, R journal, FAQs



[40]

On the website:

documentation



[42]

[43]



GPL



AGPL



GPL



Linux,

(Mac) All

platforms All

platforms

OSX, Windows

(Java based)

(Java based)

Basically csv, but various other data formats are supported

Yes



Yes



Yes



Command

line/various

GUIs



GUI



Command line,

various GUIs



Algorithms for all data mining algorithms, various visualization techniques



Interfaces to all DB systems



Supported by various algorithms

Depending on the application

[32]

[41]



[44]



From the commercial products, the SAS data mining software and the IBM SPSS

Modeler are two powerful data mining tools. Both products offer a visual interface

and allow applications without programming.



A.5 Process Mining

Table A.9 summarizes details on the process mining tool ProM which is applied in

Chap. 7. There is no comparable tool available as open-source solution; hence, only

ProM is introduced here. Nonetheless, one can mention Disco [45] as commercial

process mining tool which developed from ProM.



A.6 Text Mining



339



Table A.9 Process mining tool: ProM

ProM

Availability

Link, url

Existing documentation such as white papers

Licensing

Existing evaluations

Technical criteria

Operating system

Supported data formats



Extensibility

User interfaces

Evaluation

Functionality

Algorithms, techniques, visualizations



Data export/import, interfaces



Data preprocessing

Interactivity

Community, e.g., forum, blog



[39]

[7]

ProM 5.2: CPL, ProM 6.2: LGPL, ProM 6.3:

LGPL, ProM 6.4: GPL

All platforms

Log formats: MXML, XES, and csv; process

model formats: PNML, YAWL specification,

BPEL, CPN

Development of java-based plug ins

n.a.

Several algorithms for process discovery, conformance checking, filtering, organizational

mining, etc.; visualizations as, e.g., graphbased process models or dotted charts

Import: MXML, XES, csv, PNML, etc.;

export: process models as graphics, e.g., eps,

svg; Petri Nets: pnml; logs: MXML, XES;

reports: HTML

Filtering

Partly, e.g., mouseover and dragging of social

networks

Supported by fora, developer support, ProM

task force



A.6 Text Mining

All the data mining software products reviewed in Sect. A.4 offer text mining

facilities for classification and cluster analysis of text data represented as document

term matrices. The applicability of these tools essentially depend on the data sources

which can be read by the software, the available transformations for preprocessing,

the availability of linguistic knowledge for the language under consideration, and the

analysis algorithms. For example, the package tm can process a number of formats

by using plugins. Regarding the linguistic knowledge, part-of-speech tagging and

stemming can be done, and WordNet can be accessed as English lexical database.

For analysis, a number of advanced statistical models like topic maps or specific

classification and cluster algorithms can be used.

An open-source tools which puts more emphasis on natural language processing

is GATE. GATE stands for General Architecture for Text Engineering and was



340



A Survey on Business Intelligence Tools



developed at the University Sheffield. On the homepage [46], one can find an

extensive documentation.

GATE consists of a number of components. A core component of GATE

is an information extraction tool which offers modules for tokenization, partof-speech tagging, sentence splitting, entity identification, semantic tagging, or

referencing between entities. A number of plugins offer applications for data mining

algorithms or the management of ontologies. Another important component allows

indexing and searching of the linguistic and semantic information generated by the

applications.

GATE supports analysis of text documents in different languages and in various

data formats. The GATE Developer is the main user interface that supports the

loading of documents, the definition of a corpus, the annotations of the documents,

and the definition of applications.



References

1. Arnold P, Rahm E (2014) Enriching ontology mappings with semantic relations. Data Knowl

Eng 93:1–18

2. COMA 3.0 CE, program description (2012) Database chair. University of Leipzig, Leipzig

3. Fridman NN, Tudorache T (2008) Collaborative ontology development on the (semantic) web.

In: Symbiotic relationships between semantic web and knowledge engineering. Papers from

the 2008 AAAI spring symposium, Technical Report SS-08-07, AAAI

4. Prajapati V (2013) Big data analytics with R and hadoop. http://it-ebooks.info/book/3157/.

Accessed 11 Nov 2014

5. R Core Team (2014) R: A language and environment for statistical computing. R Foundation

for Statistical Computing, Vienna. http://www.R-project.org. Accessed 12 Dec 2014

6. Tuncer O, van den Berg J (2012) Implementing BI concepts with Pentaho, an evaluation. Delft

University of Technology, Delft

7. van der Aalst WMP (2011) Process mining: discovery, conformance and enhancement of

business processes. Springer, Heidelberg

8. Viegas FB, Wattenberg M, van Ham F, Kriss J, McKeon M (2007) Manyeyes: a site for

visualization at internet scale. IEEE Trans Vis Comput Graph 13(6):1121–1128

9. Wickham H (2009) ggplot2: Elegant graphics for data analysis. Springer, New York

10. http://www.databaseanswers.org/modelling_tools.htm. Accessed 4 Dec 2014

11. http://www.etltools.net/free-etl-tools.html. Accessed 4 Dec 2014

12. http://community.pentaho.com/. Accessed 4 Dec 2014

13. http://www.talend.com/products/big-data. Accessed 4 Dec 2014

14. http://wiki.pentaho.com/display/EAI/Latest+Pentaho+Data+Integration+%28aka+Kettle

%29+Documentation. Accessed 4 Dec 2014

15. http://www.talendforge.org/tutorials/menu.php. Accessed 4 Dec 2014

16. https://de.talend.com/resources/whitepapers?field_resource_type_tid[]=79. Accessed 4 Dec

2014

17. http://www.cloveretl.com/. Accessed 4 Dec 2014

18. http://www.jitterbit.com/. Accessed 4 Dec 2014

19. http://protege.stanford.edu/. Accessed 4 Dec 2014

20. http://dbs.uni-leipzig.de/Research/coma.html. Accessed 4 Dec 2014

21. http://www.altova.com/mapforce.html. Accessed 4 Dec 2014

22. http://protegewiki.stanford.edu/wiki/ProtegeDesktopUserDocs. Accessed 4 Dec 2014



References

23.

24.

25.

26.

27.

28.

29.

30.

31.

32.

33.

34.

35.

36.

37.

38.

39.

40.

41.

42.

43.

44.

45.

46.



http://hadoop.apache.org/. Accessed 5 Dec 2014

http://docs.0xdata.com/. Accessed 5 Dec 2014

https://storm.apache.org/. Accessed 5 Dec 2014

https://storm.apache.org/documentation/Home.html. Accessed 5 Dec 2014

http://docs.0xdata.com/. Accessed 5 Dec 2014

http://docs.0xdata.com/benchmarks/benchmarks.html. Accessed 5 Dec 2014

http://www.orientechnologies.com/orientdb/. Accessed 5 Dec 2014

http://basex.org/. Accessed 5 Dec 2014

http://docs.basex.org/wiki/Main_Page. Accessed 5 Dec 2014

http://www.inside-r.org/. Accessed 12 Dec 2014

http://www.ggobi.org/. Accessed 5 Dec 2014

http://www.highcharts.com/. Accessed 12 Dec 2014

http://www.tableausoftware.com/public/. Accessed 12 Dec 2014

http://www-969.ibm.com/software/analytics/manyeyes/. Accessed 9 Dec 2014

http://www.gapminder.org/. Accessed 12 Dec 2014

http://piktochart.com/. Accessed 12 Dec 2014

http://www.processmining.org/. Accessed 11 Dec 2014

https://rapidminer.com/. Accessed 12 Dec 2014

http://forum.rapid-i.com/. Accessed 12 Dec 2014

http://www.cs.waikato.ac.nz/ml/weka/. Accessed 12 Dec 2014

http://www.cs.waikato.ac.nz/ml/weka/book.html. Accessed 12 Dec 2014

http://www.cs.waikato.ac.nz/ml/weka/help.html. Accessed 12 Dec 2014

http://www.fluxicon.com/disco/. Accessed 19 Dec 2014

https://gate.ac.uk/. Accessed 12 Dec 2014



341



Index



Absorbing state, 226

Accuracy (data quality), 79

Activity (business process), 54, 248, 257

Actor, 54, 256, 277

Adjacency matrix, 52, 225

Adjusted R-squared, 164

Agglomerative method, 196

Aggregation, (data schema), 100

˛-Algorithm, 256, 257, 262

Alternative hypothesis, 69, 162

Analysis technique, 20, 119

Analytical business model, 17

Analytical format, 98

Analytical goal, 12, 38, 42, 211, 214, 217, 235

Analytical sandbox, 23, 337

Analytical technique, 21, 41, 159

Animation, 250

Auditing, 266

Authority, 229

Average linkage, 195



Backpropagation, 168

Bagging, 184

Bag of concepts, 312

Bag of words, 299

Balanced score card, 5, 150

Bandwidth, 169

Bar chart, 134

Bayes Theorem, 65, 178

Bias-variance trade-off, 158

Big data, 93

Bigrams, 298

Binomial distribution, 66

Bins, 130, 137



BI perspectives, 6, 12, 15, 17, 19, 41, 120, 123

Biplot, 143

Boosting, 190

Bootstrap, 184

Boxplot, 137, 138, 144

BPMN, See Business Process Modeling and

Notation (BPMN)

Business

analytics, 2, 3, 21

cockpit, 149

model, 4, 23

understanding, 16

Business process, 6, 11, 12, 36, 39, 119

compliance, 246

views, 8

Business Process Modeling and Notation

(BPMN), 54, 121, 247



CART, 183

Causal matrix, 261

Censored data, 220

Centrality, 280

Chapman–Kolmogorov equations, 225

Circle, 52

Circular layout, 281

Clarity of a model, 44

Classification, 156

Closeness, 281

Cluster, 193

analysis, 193

tree, 196

Co-clustering, 305

Coherence (data quality), 79

Comparability of a model, 44



© Springer-Verlag Berlin Heidelberg 2015

W. Grossmann, S. Rinderle-Ma, Fundamentals of Business Intelligence,

Data-Centric Systems and Applications, DOI 10.1007/978-3-662-46531-8



343



344

Comparison cloud, 301

Complete linkage, 195

Completeness (data quality), 79

Concept drift, 258

Conceptual modeling, 41

Conditional distribution, 63

Confidence, 235

bands, 69, 141, 170

interval, 69

regions, 69

Conformance checking, 246, 255

Confusion matrix, 174

Consistency (data quality), 79

Contour plot, 138

Control flow, 9, 55, 122

Coordinates (visualization), 130

Corpus, 295

Correctness of a model, 44

Correlation, 65, 140, 146, 163

Correlation matrix, 143

Covariance, 65

Cox regression, 223

Critical layer, 96

CRM use case

classification, 191

clustering, 200

data quality, 148

description, 30

prediction, 165, 166

principal components, 143

variable description, 138

Cross entropy, 174

Crossover, 262

Cross-sectional view, 9, 10, 12, 16, 120, 129,

155

Cross-validation, 170

Cross-validation, k-fold, 176

Curse of dimensionality, 160, 163, 178

Customer perspective, 6



Daisy, 194

Dashboard, 149

Data

cleaning, 81, 113

flow, 55

fusion, 112

integration, 108

mashup, 114

modeling technique, 15

provenance, 115

quality, 113, 120, 147



Index

understanding, 119

understanding technique, 16

variety, 93, 96

velocity, 93, 95

veracity, 93, 96

volume, 93, 94

Degree, 52

centrality (sociogram), 280

of a node, 280

Delta analysis, 314

Dendrogramm, 196

Density, 280, 281

Density estimate, 137

Dependent variable, 59

Deviance, 174

Dice (OLAP), 103

Dimension, 100

Directed graph, 52, 278

Dirichlet distribution, 228

Distance-based method, 193

Distance, in graphs, 281

Distribution

continuous, 63, 146

discrete, 63

empirical, 68

Distribution function, 62

Document term matrix (DTM), 299

Domain semantics, 41

Drill across, 103

Drill down, 100, 103

Dublin Core (DCMI), 294

Dummy variable, 73, 193, 202

Dyad (sociogram), 277

Dynamic process analysis, 248

Dynamic time warping, 215

EBMC2 use case

data considerations, 88

data extraction, 92

description, 25

Markov chain clustering, 230

process warehousing, 254

time to event analysis, 222

Economic efficiency of a model, 44

Edges, 51

Ego (sociogram), 281

Ego-centric measures (sociogram), 281

Elementary functions, 59

EM-algorithm, 202

Ergodic Markov chain, 226

Event-driven Process Chains (EPCs), 57, 121



Index

Event

log, 99, 104, 105, 246

sequence, 208

set, 208

view, 8, 12, 16, 39, 56, 78, 120, 129, 208,

210

Explanatory variable, 59, 156, 162, 163, 173,

180, 195

Exponential loss, 190

Ex post analysis, 247

eXtensible Event Stream (XES), 99, 257

Extract-load-transform (ELT), 97

Extract-transform-load (ETL), 90



Facet (visualization), 130, 132

Fact (OLAP), 100

Feature extraction, 208

Fitness (of process model), 270

Fitness function, 261

Flat structure, 99

Frames, 49

Frequency distribution, 68, 137

Fruchterman Reingold layout, 282



Generalization (of process model), 270

Generalization error, 157

Generalized linear models, 72

Generic questions, 39

Genetic miner, 256, 260

Granularity level, 100

Graph

bipartite, 53, 56

database, 94

series-parallel, 53, 54



Hamming distance, 193

Hazard function, 221

Heat map, 140

HEP use case

clustering, 198

data anonymization, 89

description, 28

dynamic visualization, 132, 146

process mining, 258

variable description, 134

Heuristic miner, 256, 258, 262

Hidden Markov chain, 231

Hierarchical method, 194

Hierarchical structure, 99

Histogram, 137

HITS, 229



345

Hubness, 229

Hybrid structure, 99

iMine, 21, 38, 119

Impurity measure, 183

In-degree, 280

Independent random variables, 65

Independent variables, 59

Influential factors, 11

Infographics, 151

Integration

format, 98

strategy, 109

Irreducible Markov chain, 226

Item set, 233

Jittering, 130, 137

Joint distribution, 63, 138

Kernel(s), 60

function, 169

trick, 60, 188

Key performance indicator (KPI), 11, 41, 71,

78, 159, 255

Key value store, 94

Key word extraction algorithm (KEA), 309

KKLayout, 282

K-means, 199

K-nearest neighbor, 220

K-nearest neighbor classification, 185

KPI, See Key performance indicator (KPI)

Lasso, 164

Latent Dirichlet allocation (LDA), 306

Likelihood, 63

Linear function, 60

Linear regression, 159

Linear temporal logic (LTL), 76

Linkage, 195

Linked data, 113

Loading, 92

Load shedding, 95

Log format, 104

Logistic regression, 72, 168, 180, 191

Logistics use case

change mining, 264

description, 29

time warping, 216

Logit, 180

Log structure, 99, 104

Loop, 52



346

Machine learning, 19, 204

Mapping (data schema), 100

Mapping (visualization), 128

MapReduce, 94

Margin, 186

Marginal distribution, 63

Market basket analysis, 211

Markov chain, 70, 225

aperiodic state, 226

connected state, 226

periodic state, 226

reachable states, 226

recurrent state, 226

Markov property, 70

Maximum likelihood estimation, 68, 219,

227

Mean, 129, 138

Mean square error (MSE), 161

Median, 62, 137, 138

Medoid, 200

Meta-analsyis, 313

Metadata, 81, 147, 294

Meta model, 43, 121, 122, 124

MHLAP, 101

Missing value, 80, 81, 138, 147, 184

Mixed models, 219

Model-based method, 193

Modeling

method, 39, 41, 42

task, 156

technique, 18, 43, 70

Models

analogical models, 37

complexity, 157

of data, 158

elements, 39

idealized models, 37

language, 39

language semantics, 39

phenomenological models, 37

quality criteria, 44

structure, 40, 156

MOLAP, 101

Monitoring, 113

Mosaic plot, 134

Motion chart, 133, 335

Multidimensional structure, 99

Multidimensional tables, 129

Multiple R-squared, 162

Multi-purpose automatic topic indexing

(MAUI), 309

Mutation, 262

MXML, 257



Index

Naive Bayes, 178

Neural nets, 159

n-grams, 298

Nodes, 51

Nonparametric models, 159

Nonparametric regression, 159

Normal distribution, 66, 137

Null hypothesis, 69, 162



Objectivity of a model, 44

Observable variable, 67

Observational studies, 75

Odds, 62, 180

Offline analysis, 247

Online analysis, 246

Online Analytical Processing (OLAP), 101

Ontology, 109

Operational measurement, 76

Operational model, 35

Opinion mining, 276

Organizational perspective, 6, 38, 55

Out-degree, 280

Overfitting, 158



Page rank algorithm, 229

PAIS, 283

Parallel coordinates, 145

Partition around medoids (PAM), 200, 305

Partitioning method, 194

Part-of-speech tagging (POS), 308

Path, 52

Path, vertex-disjoint, 52

Pattern (local behavior of business process), 45

Petri nets, 56, 121

Phenomenon, 36

Pie chart, 134

Pivot table, 129

dimensions, 129

summary attributes, 129

Polarity, 311

Population, 67

Posterior distribution, 228

Posterior probability, 65, 178

Post-processing, 259

Precision (of process model), 270

Predicate logic, 47

Predictive modeling, 156

Pre-eclampsia use case, description, 27

Pre-eclampsia use case, prediction, 170

Pre-eclampsia use case, response feature

analysis, 218



Tài liệu bạn tìm kiếm đã sẵn sàng tải về

A.3 Visualization, Visual Mining, and Reporting

Tải bản đầy đủ ngay(0 tr)

×