Random Sample of RDD in Spark

2) To get a random sample of your RDD (named data) say with 100000 rows and to get 20% values


where data.sample takes the parameters

Signature: data.sample(withReplacement, fraction, seed=None)

and .collect helps in getting data

2) takeSample when I specify  by size of sample (say 100)


data.takeSample(withReplacement, num, seed=None)
Return a fixed-size sampled subset of this RDD.



Simple isnt it

Author: Ajay Ohri


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: