2) To get a random sample of your RDD (named data) say with 100000 rows and to get 20% values
data.sample(False,0.02,None).collect()
where data.sample takes the parameters
?data.sample
Signature: data.sample(withReplacement, fraction, seed=None)
and .collect helps in getting data
2) takeSample when I specify by size of sample (say 100)
data.takeSample(False,100)
data.takeSample(withReplacement, num, seed=None) Docstring: Return a fixed-size sampled subset of this RDD.
Simple isnt it