Spark Scala - Read & Write files from MongoDB

Github Project : example-spark-scala-read-and-write-from-mongo

Common part

sbt Dependencies

libraryDependencies += "org.apache.spark" %% "spark-core" % "2.1.0" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.1.0" % "provided"
libraryDependencies += "org.mongodb.spark" %% "mongo-spark-connector" % "2.1.0"

assembly Dependency

// In build.sbt
import sbt.Keys._
assemblyOption in assembly := (assemblyOption in assembly).value.copy(includeScala = false)

// In project/assembly.sbt
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.1")

Mongo URI

Mongo URI are like that : mongodb://username:password@host:27017/database

Default port is 27017.

Init SparkContext and SQLContext

You need to define an input and and output collection depending to your needs.

These collection can be different.

To read from a collection, you need to choose a Partitioner that defines how documents are partitioned in your created dataframe

See the documentation for a list of all available Partitioners : https://docs.mongodb.com/spark-connector/configuration/

// Creation of SparkSession
val sparkSession = SparkSession.builder()
  .appName("example-spark-scala-read-and-write-from-mongo")
  // Configuration for writing in a Mongo collection
  .config("spark.mongodb.output.uri", params.mongoUri)
  .config("spark.mongodb.output.collection", "restaurants")
  // Configuration for reading a Mongo collection
  .config("spark.mongodb.input.uri", params.mongoUri)
  .config("spark.mongodb.input.collection", "restaurants")
  // Type of Partitionner to use to transform Documents to dataframe
  .config("spark.mongodb.input.partitioner", "MongoPaginateByCountPartitioner")
  // Number of partitions in the resulting dataframe
  .config("spark.mongodb.input.partitionerOptions.MongoPaginateByCountPartitioner.numberOfPartitions", "1")
  .getOrCreate()

How to write a file to a Mongo collection with Spark Scala?

Code example

// Creating classes representing Documents
case class Address(building: String,coord :Array[Double],street:String, zipcode: String)
case class Restaurant(address: Address,borough :String,cuisine:String, name: String)
 
// Creating a dataframe containing documents
val dfRestaurants = Seq(Restaurant(Address("1480",Array(-73.9557413,40.7720266),"2 Avenue","10075"),"Manhattan","Italian","Vella"),Restaurant(Address("1007",Array(-73.856077,40.848447),"Morris Park Ave","10462"),"Bronx","Bakery","Morris Park Bake Shop")).toDF().coalesce(1)

// Writing dataframe in Mongo collection
MongoSpark.save(dfRestaurants.write.mode("overwrite"))
logger.info("Writing documents in Mongo : OK")

How to read documents from a Mongo collection with Spark Scala ?

Code example

// Reading Mongodb collection into a dataframe
val df = MongoSpark.load(sparkSession)
logger.info(df.show())
logger.info("Reading documents from Mongo : OK")