Open Csv Read File From Hadoop Filesystem
Recipe Objective: How to create DataFrame from a JSON File, read Information from DBFS and write into the DBFS?
In today'south earth, the data generated is extensive and circuitous, which involves complex structures similar lists, maps, and struct type. And the format of information also is generated, processed, transformed into diverse formats similar CSV, parquet, Avro, and JSON. In these, maintaining Data in the grade of JSON format is very popular. So, here we will learn nigh creating DataFrame from a JSON File, reading Data from DBFS, and writing into the DBFS (Databricks File Organisation.)
Implementation Info:
- Databricks Community Edition click hither
- Spark-Scala
- sample information file 1click here
- sample information file 2click here
- sample data file 3click here
- storage - Databricks File Arrangement(DBFS)
Pace 1: Uploading data to DBFS
Follow the below steps to upload data files from local to DBFS
- Click create in Databricks carte
- Click Tabular array in the driblet-down carte, it volition open a create new table UI
- In UI, specify the folder name in which you want to save your files.
- click scan to upload and upload files from local.
- path is like /FileStore/tables/your folder proper name/your file
Refer to the image beneath for case
Pace ii: Read JSON File into DataFrame
Using spark.read.json("path") or spark.read.format("json").load("path") you lot can read a JSON file into a Spark DataFrame. These methods accept a file path as an argument. In our use case, the file path will be "/FileStore/tables/zipcode.json." Here we have used a DataBricks inbuilt office display() to view the information in the dataframe. Unlike reading a CSV, by default, JSON data source infer schema from an input file which means there is no demand to mention "inferschema" =true.
//read unmarried json file into dataframe val df = spark.read.json("/FileStore/tables/zipcode.json") df.printSchema() display(df)//This method works just in databricks notebook.
Step 3: Reading multiline JSON file.
Sometimes you may desire to read records from JSON files that scattered multiple lines. To read such files, the use-value true to the multiline option. By default multiline option is set to false. We need to specify explicitly option("multiline",true).
//read multiline json file into dataframe println("Read Data From Multiline Json") val df2 = spark.read.option("multiline",true).json("/FileStore/tables/zip_multiline.json") df2.printSchema() df2.bear witness()
Step 4: Reading Multiple Json Files
Sometimes you may want to read records from JSON files that scattered multiple lines. To read such files, the use-value true to the multiline option. By default multiline option is set to fake. In the .json() method, y'all can also read multiple JSON files from different paths. Just pass all file names with their respective paths by separating comma, as shown below.
We can read all JSON files from a directory into DataFrame just by passing the directory as a path to the json() method also.
//reading multiple files println("Reading from Multiple files") val df3 = spark.read.json("/FileStore/tables/zipcode.json", "/FileStore/tables/zipcode2.json") println("Record count "+df3.count()) df3.show(two)
Step 5: Reading files with a custom schema
Spark Schema defines the structure of the data. In other words, it is the structure of the DataFrame. Spark SQL provides StructType & StructField classes to specify the structure to the DataFrame programmatically. If yous know the file schema ahead and do not want to use the default inferSchema option for cavalcade names and types, use user-defined custom column names and blazon using the schema option.
A StructType object tin be constructed past StructType(fields: Seq[StructField])
A StructField object tin can exist constructed by StructField(coffee.lang.String name, DataType dataType, boolean nullable, Metadata metadata)
While creating a DataFrame, we can specify the structure of it past using StructType and StructField. StructType is a collection of StructField's used to define the column name, data type, and a flag for nullable or not. Using StructField, nosotros can also add nested struct schema, ArrayType for arrays, and MapType for key-value pairs.
import org.apache.spark.sql.types._ println("Read Json file By defining custom schema") val schema = new StructType() .add("City",StringType,true) .add("RecordNumber",IntegerType,true) .add("State",StringType,truthful) .add("ZipCodeType",StringType,true) .add("Zipcode",LongType,truthful) val df_with_schema = spark.read.option("multiline",true).json("/FileStore/tables/zip_multiline.json") df_with_schema.printSchema() df_with_schema.show(false)
Pace six: Writing DataFrame into DBFS(DataBricks File System)
Here, we are writing the Dataframe into DBFS into the spark_training folder created by me. Using DataBricks eradicates our custom VM for spark and HDFS. And I am using databricks filesystem commands to view the content of folder writing into DBFS.
Spark DataFrameWriter also has a method mode() to specify SaveMode; the argument to this method either takes the beneath cord or a constant from SaveMode class.
overwrite – way is used to overwrite the existing file; alternatively, you can employ SaveMode.Overwrite.
suspend – To add the data to the existing file; alternatively, you lot can use SaveMode.Append.
ignore – Ignores write operation when the file already exists; alternatively, you lot can use SaveMode.Ignore.
errorifexists or fault – This is a default option when the file already exists. It returns an error; alternatively, you tin can apply SaveMode.ErrorIfExists.
//writing the file into DBFS df.write.mode(SaveMode.Overwrite).format("json").save("/FileStore/tables/spark_training/") //DBFS Subsequently writing the DataFrame display(dbutils.fs.ls("/FileStore/tables/spark_training/"))
Conclusion
We accept learned to read a JSON file with a single line and multiline records into Spark DataFrame. We accept also learned to read single and multiple files simultaneously and write JSON files back to DataFrame using unlike salve options.
Source: https://www.projectpro.io/recipes/create-dataframe-from-json-file-read-data-from-dbfs-and-write-into-dbfs
0 Response to "Open Csv Read File From Hadoop Filesystem"
Post a Comment