Home » Posts tagged 'knowledge discovery'

Tag Archives: knowledge discovery

How to make inforgraphics easily- Use Infogr.am

What is an Infographic?

http://en.wikipedia.org/wiki/Infographic

Information graphics or infographics are graphic visual representations of information, data or knowledge intended to present complex information quickly and clearly.[1][2] They can improve cognition by utilizing graphics to enhance the human visual system’s ability to see patterns and trends.[3][4] The process of creating infographics can be referred to as data visualization, information design, or information architecture.[2]

What is Infogr.am?

It Create infographics and interactive online charts. It’s free and super-easy!

How?

Step 1

Login using Twitter or Facebook

Step 2

i1Create

Step 3

i2Choose New Infographic or New Chart?

Step 4

i3Create using options-you can edit the table, figures, text, colors etc

That’s it

Use Infogr.am to make inforgraphics easily!

 

Visual Guides to CRISP-DM ,KDD and SEMMA

UPDATED- Here are three great examples of a visualization making a process easy to understand. Please click on the images to read them clearly.

1) It visualizes CRISP-DM and is made by Nicole Leaper (http://exde.wordpress.com/2009/03/13/a-visual-guide-to-crisp-dm-methodology/)

12345

2) KDD -Knowledge Discovery in Databases -visualization by Fayyad whom I have interviewed here at http://www.decisionstats.com/interview-dr-usama-fayyad-founder-open-insights-llc/

and work By Gregory Piatetsky Shapiro interviewed by this website here

http://decisionstats.com/2009/08/13/interview-gregory-piatetsky-kdnuggets-com/

kdd

3) I am also attaching a visual representation of SEMMA from http://www.dataprix.net/en/blogs/respinosamilla/theory-data-mining

metodo-semma

 

Analytics for Cyber Conflict -Part Deux

Part 1 in this series is avaiable at http://www.decisionstats.com/analytics-for-cyber-conflict/

The next articles in this series will cover-

  1. the kind of algorithms that are currently or being proposed for cyber conflict, as well as or detection

Cyber Conflict requires some basic elements of the following broad disciplines within Computer and Information Science (besides the obvious disciplines of heterogeneous database types for different kinds of data) -

1) Cryptography – particularly a cryptographic  hash function that maximizes cost and time of the enemy trying to break it.

From http://en.wikipedia.org/wiki/Cryptographic_hash_function

The ideal cryptographic hash function has four main or significant properties:

  • it is easy (but not necessarily quick) to compute the hash value for any given message
  • it is infeasible to generate a message that has a given hash
  • it is infeasible to modify a message without changing the hash
  • it is infeasible to find two different messages with the same hash

A commercial spin off is to use this to anonymized all customer data stored in any database, such that no database (or data table) that is breached contains personally identifiable information. For example anonymizing the IP Addresses and DNS records with a mashup  (embedded by default within all browsers) of Tor and MafiaaFire extensions can help create better information privacy on the internet.

This can also help in creating better encryption between Instant Messengers in Communication

2) Data Disaster Planning for Data Storage (but also simulations for breaches)- including using cloud computing, time sharing, or RAID for backing up data. Planning and creating an annual (?) exercise for a simulated cyber breach of confidential just like a cyber audit- similar to an annual accounting audit

3) Basic Data Reduction Algorithms for visualizing large amounts of information. This can include

  1. K Means Clustering, http://www.jstor.org/pss/2346830 , http://www.cs.ust.hk/~qyang/Teaching/537/Papers/huang98extensions.pdf , and http://stackoverflow.com/questions/6372397/k-means-with-really-large-matrix
  2. Topic Models (LDA) http://www.decisionstats.com/topic-models/,
  3. Social Network Analysis http://en.wikipedia.org/wiki/Social_network_analysis,
  4. Graph Analysis http://micans.org/mcl/ and http://www.ncbi.nlm.nih.gov/pubmed/19407357
  5. MapReduce and Parallelization algorithms for computational boosting http://www.slideshare.net/marin_dimitrov/large-scale-data-analysis-with-mapreduce-part-i

In the next article we will examine

  1. the role of non state agents as well as state agents competing and cooperating,
  2. and what precautions can knowledge discovery in databases practitioners employ to avoid breaches of security, ethics, and regulation.

Analytics for Cyber Conflict

 

The emerging use of Analytics and Knowledge Discovery in Databases for Cyber Conflict and Trade Negotiations

 

The blog post is the first in series or articles on cyber conflict and the use of analytics for targeting in both offense and defense in conflict situations.

 

It covers knowledge discovery in four kinds of databases (so chosen because of perceived importance , sensitivity, criticality and functioning of the geopolitical economic system)-

  1. Databases on Unique Identity Identifiers- including next generation biometric databases connected to Government Initiatives and Banking, and current generation databases of identifiers like government issued documents made online
  2. Databases on financial details -This includes not only traditional financial service providers but also online databases with payment details collected by retail product selling corporates like Sony’s Playstation Network, Microsoft ‘s XBox and
  3. Databases on contact details – including those by offline businesses collecting marketing databases and contact details
  4. Databases on social behavior- primarily collected by online businesses like Facebook , and other social media platforms.

It examines the role of

  1. voluntary privacy safeguards and government regulations ,

  2. weak cryptographic security of databases,

  3. weakness in balancing marketing ( maximized data ) with privacy (minimized data)

  4. and lastly the role of ownership patterns in database owning corporates

A small distinction between cyber crime and cyber conflict is that while cyber crime focusses on stealing data, intellectual property and information  to primarily maximize economic gains

cyber conflict focuses on stealing information and also disrupt effective working of database backed systems in order to gain notional competitive advantages in economics as well as geo-politics. Cyber terrorism is basically cyber conflict by non-state agents or by designated terrorist states as defined by the regulations of the “target” entity. A cyber attack is an offensive action related to cyber-infrastructure (like the Stuxnet worm that disabled uranium enrichment centrifuges of Iran). Cyber attacks and cyber terrorism are out of scope of this paper, we will concentrate on cyber conflicts involving databases.

Some examples are given here-

Types of Knowledge Discovery in -

1) Databases on Unique Identifiers- including biometric databases.

Unique Identifiers or primary keys for identifying people are critical for any intensive knowledge discovery program. The unique identifier generated must be extremely secure , and not liable to reverse engineering of the cryptographic hash function.

For biometric databases, an interesting possibility could be determining the ethnic identity from biometric information, and also mapping relatives. Current biometric information that is collected is- fingerprint data, eyes iris data, facial data. A further feature could be adding in voice data as a part of biometric databases.

This is subject to obvious privacy safeguards.

For example, Google recently unveiled facial recognition to unlock Android 4.0 mobiles, only to find out that the security feature could easily be bypassed by using a photo of the owner.

 

 

Example of Biometric Databases

In Afghanistan more than 2 million Afghans have contributed iris, fingerprint, facial data to a biometric database. In India, 121 million people have already been enrolled in the largest biometric database in the world. More than half a million customers of the Tokyo Mitsubishi Bank are are already using biometric verification at ATMs.

Examples of Breached Online Databases

In 2011, Playstation Network by Sony (PSN) lost data of 77 million customers including personal information and credit card information. Additionally data of 24 million customers were lost by Sony’s Sony Online Entertainment. The websites of open source platforms like SourceForge, WineHQ and Kernel.org were also broken into 2011. Even retailers like McDonald and Walgreen reported database breaches.

 

The role of cyber conflict arises in the following cases-

  1. Databases are online for accessing and authentication by proper users. Databases can be breached remotely by non-owners ( or “perpetrators”) non with much lesser chance of intruder identification, detection and penalization by regulators, or law enforcers (or “protectors”) than offline modes of intellectual property theft.

  2. Databases are valuable to external agents (or “sponsors”) subsidizing ( with finance, technology, information, motivation) the perpetrators for intellectual property theft. Databases contain information that can be used to disrupt the functioning of a particular economy, corporation (or “ primary targets”) or for further chain or domino effects in accessing other data (or “secondary targets”)

  3. Loss of data is more expensive than enhanced cost of security to database owners

  4. Loss of data is more disruptive to people whose data is contained within the database (or “customers”)

So the role play for different people for these kind of databases consists of-

1) Customers- who are in the database

2) Owners -who own the database. They together form the primary and secondary targets.

3) Protectors- who help customers and owners secure the databases.

and

1) Sponsors- who benefit from the theft or disruption of the database

2) Perpetrators- who execute the actual theft and disruption in the database

The use of topic models and LDA is known for making data reduction on text, and the use of data visualization including tied to GPS based location data is well known for investigative purposes, but the increasing complexity of both data generation and the sophistication of machine learning driven data processing makes this an interesting area to watch.

 

 

The next article in this series will cover-

the kind of algorithms that are currently or being proposed for cyber conflict, the role of non state agents , and what precautions can knowledge discovery in databases practitioners employ to avoid breaches of security, ethics, and regulation.

Citations-

  1. Michael A. Vatis , CYBER ATTACKS DURING THE WAR ON TERRORISM: A PREDICTIVE ANALYSIS Dartmouth College (Institute for Security Technology Studies).
  2. From Data Mining to Knowledge Discovery in Databases Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyt

Knowledge Discovery in Databases -KDD using PostgreSQL and #Rstats

Here is a small brief primer for beginners on configuring an open source database and using an open source analytics package.

All you need to know – is to read!

 

1. download PostgreSQL from
http://www.postgresql.org/download/windowsInstall PostgreSQL

Remember to store /memorize the password for the user postgres!

Create a connection using pgAdmin feature in Start Menu

2. download ODBC driver from
http://www.postgresql.org/ftp/odbc/versions/msi/
and the Win 64 edition from
http://wwwmaster.postgresql.org/download/mirrors-ftp/odbc/versions/msi/psqlodbc_09_00_0310-x64.zip

install ODBC driver

3. Go to

Start Menu\Control Panel\All Control Panel Items\Administrative Tools\Data Sources (ODBC)

4. Configure the following details in System DSN and  User DSN using the ADD tabs .Test connection to check if connection is working

5. Start R and install and load library RODBC

6. Use following initial code for R- if you know SQL you can  do the rest
> library(RODBC)

> odbcDataSources(type = c(“all”, “user”, “system”))
SQLServer              PostgreSQL30             PostgreSQL35W
“SQL Server”    “PostgreSQL ANSI(x64)” “PostgreSQL Unicode(x64)”

> ajay=odbcConnect(“PostgreSQL30″, uid = “postgres”, pwd = “XX”)

> sqlTables(ajay)
TABLE_QUALIFIER TABLE_OWNER TABLE_NAME TABLE_TYPE REMARKS
1        postgres      public      names      TABLE

> crimedat <- sqlFetch(ajay, “names”)

New book on BigData Analytics and Data mining using #Rstats with a GUI

Joseph Marie Jacquard

Image via Wikipedia

I am hoping to put this on my pre-ordered or Amazon Wish list. The book the common people who wanted to do data mining with , but were unable to ask aloud they didnt know much.  It is written by the seminal Australian authority on data mining Dr Graham Williams whom I interviewed here at http://decisionstats.com/2009/01/13/interview-dr-graham-williams/

Data Mining for the masses using an ergonomically designed Graphical User Interface.

Thank you Springer. Thank you Dr Graham Williams

http://www.springer.com/statistics/physical+%26+information+science/book/978-1-4419-9889-7

Data Mining with Rattle and R

Data Mining with Rattle and R

The Art of Excavating Data for Knowledge Discovery

Series: Use R

Williams, Graham

1st Edition., 2011, XX, 409 p. 150 illus. in color.

  • Softcover, ISBN 978-1-4419-9889-7

    Due: August 29, 2011

    54,95 €
  • Encourages the concept of programming with data – more than just pushing data through tools, but learning to live and breathe the data
  • Accessible to many readers and not necessarily just those with strong backgrounds in computer science or statistics
  • Details some of the more popular algorithms for data mining, as well as covering model evaluation and model deployment

Data mining is the art and science of intelligent data analysis. By building knowledge from information, data mining adds considerable value to the ever increasing stores of electronic data that abound today. In performing data mining many decisions need to be made regarding the choice of methodology, the choice of data, the choice of tools, and the choice of algorithms.

Throughout this book the reader is introduced to the basic concepts and some of the more popular algorithms of data mining. With a focus on the hands-on end-to-end process for data mining, Williams guides the reader through various capabilities of the easy to use, free, and open source Rattle Data Mining Software built on the sophisticated R Statistical Software. The focus on doing data mining rather than just reading about data mining is refreshing.

The book covers data understanding, data preparation, data refinement, model building, model evaluation,  and practical deployment. The reader will learn to rapidly deliver a data mining project using software easily installed for free from the Internet. Coupling Rattle with R delivers a very sophisticated data mining environment with all the power, and more, of the many commercial offerings.

Content Level » Research

Keywords » Data mining

Related subjects » Physical & Information Science

Related- http://decisionstats.com/2009/01/13/interview-dr-graham-williams/

Interview Ajay Ohri Decisionstats.com with DMR

From-

http://www.dataminingblog.com/data-mining-research-interview-ajay-ohri/

Here is the winner of the Data Mining Research People Award 2010: Ajay Ohri! Thanks to Ajay for giving some time to answer Data Mining Research questions. And all the best to his blog, Decision Stat!

Data Mining Research (DMR): Could you please introduce yourself to the readers of Data Mining Research?

Ajay Ohri (AO): I am a business consultant and writer based out of Delhi- India. I have been working in and around the field of business analytics since 2004, and have worked with some very good and big companies primarily in financial analytics and outsourced analytics. Since 2007, I have been writing my blog at http://decisionstats.com which now has almost 10,000 views monthly.

All in all, I wrote about data, and my hobby is also writing (poetry). Both my hobby and my profession stem from my education ( a masters in business, and a bachelors in mechanical engineering).

My research interests in data mining are interfaces (simpler interfaces to enable better data mining), education (making data mining less complex and accessible to more people and students), and time series and regression (specifically ARIMAX)
In business my research interests software marketing strategies (open source, Software as a service, advertising supported versus traditional licensing) and creation of technology and entrepreneurial hubs (like Palo Alto and Research Triangle, or Bangalore India).

DMR: I know you have worked with both SAS and R. Could you give your opinion about these two data mining tools?

AO: As per my understanding, SAS stands for SAS language, SAS Institute and SAS software platform. The terms are interchangeably used by people in industry and academia- but there have been some branding issues on this.
I have not worked much with SAS Enterprise Miner , probably because I could not afford it as business consultant, and organizations I worked with did not have a budget for Enterprise Miner.
I have worked alone and in teams with Base SAS, SAS Stat, SAS Access, and SAS ETS- and JMP. Also I worked with SAS BI but as a user to extract information.
You could say my use of SAS platform was mostly in predictive analytics and reporting, but I have a couple of projects under my belt for knowledge discovery and data mining, and pattern analysis. Again some of my SAS experience is a bit dated for almost 1 year ago.

I really like specific parts of SAS platform – as in the interface design of JMP (which is better than Enterprise Guide or Base SAS ) -and Proc Sort in Base SAS- I guess sequential processing of data makes SAS way faster- though with computing evolving from Desktops/Servers to even cheaper time shared cloud computers- I am not sure how long Base SAS and SAS Stat can hold this unique selling proposition.

I dislike the clutter in SAS Stat output, it confuses me with too much information, and I dislike shoddy graphics in the rendering output of graphical engine of SAS. Its shoddy coding work in SAS/Graph and if JMP can give better graphics why is legacy source code preventing SAS platform from doing a better job of it.

I sometimes think the best part of SAS is actually code written by Goodnight and Sall in 1970’s , the latest procs don’t impress me much.

SAS as a company is something I admire especially for its way of treating employees globally- but it is strange to see the rest of tech industry not following it. Also I don’t like over aggression and the SAS versus Rest of the Analytics /Data Mining World mentality that I sometimes pick up when I deal with industry thought leaders.

I think making SAS Enterprise Miner, JMP, and Base SAS in a completely new web interface priced at per hour rates is my wishlist but I guess I am a bit sentimental here- most data miners I know from early 2000’s did start with SAS as their first bread earning software. Also I think SAS needs to be better priced in Business Intelligence- it seems quite cheap in BI compared to Cognos/IBM but expensive in analytical licensing.

If you are a new stats or business student, chances are – you may know much more R than SAS today. The shift in education at least has been very rapid, and I guess R is also more of a platform than a analytics or data mining software.

I like a lot of things in R- from graphics, to better data mining packages, modular design of software, but above all I like the can do kick ass spirit of R community. Lots of young people collaborating with lots of young to old professors, and the energy is infectious. Everybody is a CEO in R ’s world. Latest data mining algols will probably start in R, published in journals.

Which is better for data mining SAS or R? It depends on your data and your deadline. The golden rule of management and business is -it depends.

Also I have worked with a lot of KXEN, SQL, SPSS.

DMR: Can you tell us more about Decision Stats? You have a traffic of 120′000 for 2010. How did you reach such a success?

AO: I don’t think 120,000 is a success. Its not a failure. It just happened- the more I wrote, the more people read.In 2007-2008 I used to obsess over traffic. I tried SEO, comments, back linking, and I did some black hat experimental stuff. Some of it worked- some didn’t.

In the end, I started asking questions and interviewing people. To my surprise, senior management is almost always more candid , frank and honest about their views while middle managers, public relations, marketing folks can be defensive.

Social Media helped a bit- Twitter, Linkedin, Facebook really helped my network of friends who I suppose acted as informal ambassadors to spread the word.
Again I was constrained by necessity than choices- my middle class finances ( I also had a baby son in 2007-my current laptop still has some broken keys :) – by my inability to afford traveling to conferences, and my location Delhi isn’t really a tech hub.

The more questions I asked around the internet, the more people responded, and I wrote it all down.

I guess I just was lucky to meet a lot of nice people on the internet who took time to mentor and educate me.

I tried building other websites but didn’t succeed so i guess I really don’t know. I am not a smart coder, not very clever at writing but I do try to be honest.

Basic economics says pricing is proportional to demand and inversely proportional to supply. Honest and candid opinions have infinite demand and an uncertain supply.

DMR: There is a rumor about a R book you plan to publish in 2011 :-) Can you confirm the rumor and tell us more?

AO: I just signed a contract with Springer for ” R for Business Analytics”. R is a great software, and lots of books for statistically trained people, but I felt like writing a book for the MBAs and existing analytics users- on how to easily transition to R for Analytics.

Like any language there are tricks and tweaks in R, and with a focus on code editors, IDE, GUI, web interfaces, R’s famous learning curve can be bent a bit.

Making analytics beautiful, and simpler to use is always a passion for me. With 3000 packages, R can be used for a lot more things and a lot more simply than is commonly understood.
The target audience however is business analysts- or people working in corporate environments.

Brief Bio-
Ajay Ohri has been working in the field of analytics since 2004 , when it was a still nascent emerging Industries in India. He has worked with the top two Indian outsourcers listed on NYSE,and with Citigroup on cross sell analytics where he helped sell an extra 50000 credit cards by cross sell analytics .He was one of the very first independent data mining consultants in India working on analytics products and domestic Indian market analytics .He regularly writes on analytics topics on his web site www.decisionstats.com and is currently working on open source analytical tools like R besides analytical software like SPSS and SAS.

Follow

Get every new post delivered to your Inbox.

Join 844 other followers