amazon s3 - pyspark select subset of files using regex/glob from s3

Question

Welcome To Ask or Share your Answers For Others

amazon s3 - pyspark select subset of files using regex/glob from s3

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

amazon s3 - pyspark select subset of files using regex/glob from s3

I have a number files each segregated by date (date=yyyymmdd) on amazon s3. The files go back 6 months but I would like to restrict my script to only use the last 3 months of data. I am unsure as to whether I will be able to use regular expressions to do something like sc.textFile("s3://path_to_dir/yyyy[m1,m2,m3]*")

where m1,m2,m3 represents the 3 months from the current date that I would like to use.

One discussion also suggested using something like sc.textFile("s3://path_to_dir/yyyym1*","s3://path_to_dir/yyyym2*","s3://path_to_dir/yyyym3*") but that doesn't seem to work for me.

Does sc.textFile( ) take regular expressions? I know you can use glob expressions but I was unsure how to represent the above case as a glob expression?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T18:23:27+0000

For your first option, use curly braces:

sc.textFile("s3://path_to_dir/yyyy{m1,m2,m3}*")

For your second option, you can read each single glob into an RDD and then union those RDDs into a single one:

m1 = sc.textFile("s3://path_to_dir/yyyym1*")
m2 = sc.textFile("s3://path_to_dir/yyyym2*")
m3 = sc.textFile("s3://path_to_dir/yyyym3*")
all = m1.union(m2).union(m3)

You can use globs with sc.textFile but not full regular expressions.

Categories

amazon s3 - pyspark select subset of files using regex/glob from s3

amazon s3 - pyspark select subset of files using regex/glob from s3

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags