Home » Posts tagged 'useR' (Page 3)
Tag Archives: useR
The noted Diamonds dataset in the ggplot2 package of R is actually culled from the website http://www.diamondse.info/diamond-prices.asp
However it has ~55000 diamonds, while the whole Diamonds search engine has almost ten times that number. Using iMacros – a Google Chrome Plugin, we can scrape that data (or almost any data). The iMacros chrome plugin is available at https://chrome.google.com/webstore/detail/cplklnmnlbnpmjogncfgfijoopmnlemp while notes on coding are at http://wiki.imacros.net
Imacros makes coding as easy as recording macro and the code is automatcially generated for whatever actions you do. You can set parameters to extract only specific parts of the website, and code can be run into a loop (of 9999 times!)
Here is the iMacros code-Note you need to navigate to the web site http://www.diamondse.info/diamond-prices.asp before running it
VERSION BUILD=5100505 RECORDER=CR
SET !EXTRACT_TEST_POPUP NO
SET !ERRORIGNORE YES
TAG POS=6 TYPE=TABLE ATTR=TXT:* EXTRACT=TXT
TAG POS=1 TYPE=DIV ATTR=CLASS:paginate_enabled_next
SAVEAS TYPE=EXTRACT FOLDER=* FILE=test+3
and voila- all the diamonds you need to analyze!
The returning data can be read using the standard delimiter data munging in the language of SAS or R.
More on IMacros from
Automate your web browser. Record and replay repetitious work
If you encounter any problems with iMacros for Chrome, please let us know in our Chrome user forum at http://forum.iopus.com/viewforum.php?f=21 Our forum is also the best place for new feature suggestions ---- iMacros was designed to automate the most repetitious tasks on the web. If there’s an activity you have to do repeatedly, just record it in iMacros. The next time you need to do it, the entire macro will run at the click of a button! With iMacros, you can quickly and easily fill out web forms, remember passwords, create a webmail notifier, and more. You can keep the macros on your computer for your own use, use them within bookmark sync / Xmarks or share them with others by embedding them on your homepage, blog, company Intranet or any social bookmarking service as bookmarklet. The uses are limited only by your imagination! Popular uses are as web macro recorder, form filler on steroids and highly-secure password manager (256-bit AES encryption).
I really liked the software Qbittorent available from http://www.qbittorrent.org/ I think bit torrents should be the default way of sharing huge content especially software downloads. For protecting intellectual property there should be much better codes and software keys than presently available.
The qBittorrent project aims to provide a Free Software alternative to µtorrent. Additionally, qBittorrent runs and provides the same features on all major platforms (Linux, Mac OS X, Windows, OS/2, FreeBSD).
qBittorrent is based on Qt4 toolkit and libtorrent-rasterbar.
qBittorrent v2 Features
- Polished µTorrent-like User Interface
- Well-integrated and extensible Search Engine
- Simultaneous search in most famous BitTorrent search sites
- Per-category-specific search requests (e.g. Books, Music, Movies)
- All Bittorrent extensions
- DHT, Peer Exchange, Full encryption, Magnet/BitComet URIs, …
- Remote control through a Web user interface
- Nearly identical to the regular UI, all in Ajax
- Advanced control over trackers, peers and torrents
- Torrents queueing and prioritizing
- Torrent content selection and prioritizing
- UPnP / NAT-PMP port forwarding support
- Available in ~25 languages (Unicode support)
- Torrent creation tool
- Advanced RSS support with download filters (inc. regex)
- Bandwidth scheduler
- IP Filtering (eMule and PeerGuardian compatible)
- IPv6 compliant
- Sequential downloading (aka “Download in order”)
- Available on most platforms: Linux, Mac OS X, Windows, OS/2, FreeBSDSo if you are new to Bit Torrents- here is a brief tutorialSome terminology from
- A tracker is a server that keeps track of which seeds and peers are in the swarm.
- A Seed is used to refer to a peer who has 100% of the data. When a leech obtains 100% of the data, that peer automatically becomes a Seed.
- A peer is one instance of a BitTorrent client running on a computer on the Internet to which other clients connect and transfer data.
- A leech is a term with two meanings. Primarily leech (or leeches) refer to a peer (or peers) who has a negative effect on the swarm by having a very poor share ratio (downloading much more than they upload, creating a ratio less than 1.0)1) Download and install the software from http://www.qbittorrent.org/2) If you want to search for new files, you can use the nice search features in here3) If you want to CREATE new bit torrents- go to Tools -Torrent Creator4) For sharing content- just seed the torrent you just created. What is seeding – hey did you read the terminology in the beginning?5) Additionally -From
Trackers: Below are some popular public trackers. They are servers which help peers to communicate.
Here are some good trackers you can use:
- When a file is new, much time can be wasted because the seeding client might send the same file piece to many different peers, while other pieces have not yet been downloaded at all. Some clients, like ABC, Vuze, BitTornado, TorrentStorm, and µTorrent have a “super-seed” mode, where they try to only send out pieces that have never been sent out before, theoretically making the initial propagation of the file much faster. However the super-seeding becomes less effective and may even reduce performance compared to the normal “rarest first” model in cases where some peers have poor or limited connectivity. This mode is generally used only for a new torrent, or one which must be re-seeded because no other seeds are available.
- Note- you use this tutorial and any or all steps at your own risk. I am not legally responsible for any mishaps you get into. Please be responsible while being an efficient bit tor renter. That means respecting individual property rights.
Early Registration Deadline Approaches for UseR 2012
- Early Registration: Jan 23 24 – Feb 29
- Regular Registration: Mar 1 – May 12
- Late Registration: May 13 – June 4
- On-site Registration: June 12 – June 15
Vanderbilt University; Nashville, Tennessee, USA
12th-15th June 2012
Assistant to the Chair
Vanderbilt University School of Medicine
Department of Biostatistics
S-2323 Medical Center North
Nashville, TN 37232-2158
I almost missed this because of my vacation and traveling
Rapid Miner has a tonne of new stuff (Statuary Ethics Declaration- Rapid Miner has been an advertising partner for Decisionstats – see the right margin)
Great New Graphical Plotters
and some flashy work
and a great series of educational lectures
A Simple Explanation of Decision Tree Modeling based on Entropies
Description of some of the basics of decision trees. Simple and hardly any math, I like the plots explaining the basic idea of the entropy as splitting criterion (although we actually calculate gain ratio differently than explained…)
Logistic Regression for Business Analytics using RapidMiner
Same as above, but this time for modeling with logistic regression.
Easy to read and covering all basic ideas together with some examples. If you are not familiar with the topic yet, part 1 (see below) might help.
and lastly a new research project for collaborative data mining
e-LICO Architecture and Components
The goal of the e-LICO project is to build a virtual laboratory for interdisciplinary collaborative research in data mining and data-intensive sciences. The proposed e-lab will comprise three layers: the e-science and data mining layers will form a generic research environment that can be adapted to different scientific domains by customizing the application layer.
- Drag a data set into one of the slots. It will be automatically detected as training data, test data or apply data, depending on whether it has a label or not.
- Select a goal. The most frequent one is probably “Predictive Modelling”. All goals have comments, so you see what they can be used for.
- Select “Fetch plans” and wait a bit to get a list of processes that solve your problem. Once the planning completes, select one of the processes (you can see a preview at the right) and run it. Alternatively, select multiple (selecting none means selecting all) and evaluate them on your data in a batch.
The assistant strives to generate processes that are compatible with your data. To do so, it performs a lot of clever operations, e.g., it automatically replaces missing values if missing values exist and this is required by the learning algorithm or performs a normalization when using a distance-based learner.
You can install the extension directly by using the Rapid-I Marketplace instead of the old update server. Just go to the preferences and enter http://rapidupdate.de:8180/UpdateServer as the update URL
Of course Rapid Miner has been of the most professional open source analytics company and they have been doing it for a long time now. I am particularly impressed by the product map (see below) and the graphical user interface.
Just click on the products in the overview below in order to get more information about Rapid-I products.
What is Radoop? Quite possibly an exciting mix of analytics and big data computing
What is Radoop?
Hadoop is an excellent tool for analyzing large data sets, but it lacks an easy-to-use graphical interface. RapidMiner is an excellent tool for data analytics, but its data size is limited by the memory available, and a single machine is often not enough to run the analyses on time. In this project, we combine the strengths of both projects and provide a RapidMiner extension for editing and running ETL, data analytics and machine learning processes over Hadoop.
We have closely integrated the highly optimized data analytics capabilities of Hive and Mahout, and the user-friendly interface of RapidMiner to form a powerful and easy-to-use data analytics solution for Hadoop.
and what’s new
Radoop 0.3 released – fully graphical big data analytics
Today, Radoop had a major step forward with its 0.3 release. The new version of the visual big data analytics package adds full support for all major Hadoop distributions used these days: Apache Hadoop 0.20.2, 0.20.203, 1.0 and Cloudera’s Distribution including Apache Hadoop 3 (CDH3). It also adds support for large clusters by allowing the namenode, the jobtracker and the Hive server to reside on different nodes.
As Radoop’s promise is to make big data analytics easier, the 0.3 release is also focused on improving the user interface. It has an enhanced breakpointing system which allows to investigate intermediate results, and it adds dozens of quick fixes, so common process design mistakes get much easier to solve.
There are many further improvements and fixes, so please consult the release notes for more details. Radoop is in private beta mode, but heading towards a public release in Q2 2012. If you would like to get early access, then please apply at the signup page or describe your use case in email (beta at radoop.eu).
Radoop 0.3 (15 February 2012)
- Support for Apache Hadoop 0.20.2, 0.20.203, 1.0 and Cloudera’s Distribution Including Apache Hadoop 3 (CDH3) in a single release
- Support for clusters with separate master nodes (namenode, jobtracker, Hive server)
- Enhanced breakpointing to evaluate intermediate results
- Dozens of quick fixes for the most common process design errors
- Improved process design and error reporting
- New welcome perspective to help in the first steps
- Many bugfixes and performance improvements
Radoop 0.2.2 (6 December 2011)
- More Aggregate functions and distinct option
- Generate ID operator for convenience
- Numerous bug fixes and improvements
- Improved user interface
Radoop 0.2.1 (16 September 2011)
- Set Role and Data Multiplier operators
- Management panel for testing Hadoop connections
- Stability improvements for Hive access
- Further small bugfixes and improvements
Radoop 0.2 (26 July 2011)
- Three new algoritms: Fuzzy K-Means, Canopy, and Dirichlet clustering
- Three new data preprocessing operators: Normalize, Replace, and Replace Missing Values
- Significant speed improvements in data transmission and interactive analytics
- Increased stability and speedup for K-Means
- More flexible settings for Join operations
- More meaningful error messages
- Other small bugfixes and improvements
Radoop 0.1 (14 June 2011)
Initial release with 26 operators for data transmission, data preprocessing, and one clustering algorithm.
Note that Rapid Miner also has a great R extension so you can use R, a graphical interface and big data analytics is now easier and more powerful than ever.