Add S3 support to Spark/Spark-Shell

Out of the box, Spark does not come with S3 support. So running something like this in the spark-shell :

scala> spark.read.parquet("s3a://my-bucket/my-data.parquet").printSchema

will yield something like this:

2018-09-05 09:47:59 WARN FileStreamSink:66 - Error while looking for metadata directory. java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195) at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) at org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:705) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:389) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:389) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at scala.collection.immutable.List.flatMap(List.scala:344) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:388) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227) at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:622) at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:606) ... 49 elided Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193) ... 69 more

S3 support is easy to add though. We need two things/jars: Support for the S3 filesystem ( hadoop-aws ) and support for the S3 client ( aws-java-sdk ).

In order to download the correct versions, first check your Spark distribution for the hadoop version being used by looking into jars/ . There are a couple of hadoop jars mentioning the version. For Spark 2.3.1 that would be Hadoop 2.7.3.

Next, find the respective version of hadoop-aws ( https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws ) on the Maven central and download it.

On Maven central, you can also find the version of aws-java-sdk , hadoop-aws is depending on. Download it as well and put both into jars/ in your spark folder.

Configure AWS credentials for Spark ( conf/spark-defaults.conf ):

spark.hadoop.fs.s3a.access.key YOUR_ACCESS_KEY spark.hadoop.fs.s3a.secret.key YOUR_SECRET_KEY

Trying to access the data on S3 again should work now:

scala> spark.read.parquet("s3a://my-bucket/my-data.parquet").printSchema root |-- date_time: long (nullable = true) |-- width: integer (nullable = true) |-- height: integer (nullable = true) ...

Latest Images

Trending Articles

Latest Images