带有10S JSON的源S3定位
part-0000...
文件除了下面有没有最好的选择,
我对上面的设计有以下的疑问
发布于 2020-05-05 14:43:06
spark.read
.json(sourcePath)
.coalesce(1)
.write
.mode(SaveMode.Overwrite)
.json(tempTarget1)
val fs = FileSystem.get(new URI(s"s3a://$bucketName"), sc.hadoopConfiguration)
val deleted = fs
.delete(new Path(sourcePath + File.separator), true)
logger.info(s"S3 folder path deleted=${deleted} sparkUuid=$sparkUuid path=${sourcePath}")
val renamed = fs
.rename(new Path(tempTarget1),new Path(sourcePath))
试过但失败了
cachedDf.write
时,都会返回检查S3文件,这是我在编写之前手动清理的。发布于 2020-05-01 17:26:17
是的,跳过#2是可能的。使用SaveMode.Overwrite
可以完成对同一个位置的写入,与您读取的位置相同。
当您第一次将json (即#1 )读取为dataframe时,如果您进行缓存,它将在内存中。在此之后,您可以做一个清理和组合所有json在一个与联合和存储在拼花文件在一个步骤。就像这个例子。
案例1:所有的jsons都在不同的文件夹中,您希望它们将最终的数据作为拼花存储在jsons所在的同一位置.
val dfpath1 = spark.read.json("path1")
val dfpath2 = spark.read.json("path2")
val dfpath3 = spark.read.json("path3")
val df1 = cleanup1 function dfpath1 returns dataframe
val df2 = cleanup2 function dfpath2 returns dataframe
val df3 = cleanup3 function dfpath3 returns dataframe
val dfs = Seq(df1, df2, df3)
val finaldf = dfs.reduce(_ union _) // you should have same schema while doing union..
finaldf.write.mode(SaveMode.Overwrite).parquet("final_file with samelocations json.parquet")
Case 2:所有的jsons都在同一个文件夹中,您希望它们将最终的数据作为多个拼图存储在同一根位置,而jsons就在其中.
在本例中,不需要将数据读取为多个数据格式,您可以给出根路径,其中jsons具有相同的模式。
val dfpath1 = spark.read.json("rootpathofyourjsons with same schema")
// or you can give multiple paths spark.read.json("path1","path2","path3")
// since it s supported by spark dataframe reader like this ...def json(paths: String*):
val finaldf = cleanup1 function returns dataframe
finaldf.write.mode(SaveMode.Overwrite).parquet("final_file with sameroot locations json.parquet")
AFAIK,在这两种情况下,aws不再是必需的。
更新:注册表。文件找不到你面临的异常。请参阅下面的代码示例,说明如何做到这一点。我引用了你在这里给我看的那个例子
import org.apache.spark.sql.functions._
val df = Seq((1, 10), (2, 20), (3, 30)).toDS.toDF("sex", "date")
df.show(false)
df.repartition(1).write.format("parquet").mode("overwrite").save(".../temp") // save it
val df1 = spark.read.format("parquet").load(".../temp") // read back again
val df2 = df1.withColumn("cleanup" , lit("Quick silver want to cleanup")) // like you said you want to clean it.
//BELOW 2 ARE IMPORTANT STEPS LIKE `cache` and `show` forcing a light action show(1) with out which FileNotFoundException will come.
df2.cache // cache to avoid FileNotFoundException
df2.show(2, false) // light action to avoid FileNotFoundException
// or println(df2.count) // action
df2.repartition(1).write.format("parquet").mode("overwrite").save(".../temp")
println("quick silver saved in same directory where he read it from final records he saved after clean up are ")
df2.show(false)
结果:
+---+----+
|sex|date|
+---+----+
|1 |10 |
|2 |20 |
|3 |30 |
+---+----+
+---+----+----------------------------+
|sex|date|cleanup |
+---+----+----------------------------+
|1 |10 |Quick silver want to cleanup|
|2 |20 |Quick silver want to cleanup|
+---+----+----------------------------+
only showing top 2 rows
quick silver saved in same directory where he read it from final records he saved after clean up are
+---+----+----------------------------+
|sex|date|cleanup |
+---+----+----------------------------+
|1 |10 |Quick silver want to cleanup|
|2 |20 |Quick silver want to cleanup|
|3 |30 |Quick silver want to cleanup|
+---+----+----------------------------+
文件保存和读取的屏幕截图已清除,并再次保存:
注意:您需要实现案例1或案例2,就像上面建议的更新一样.
https://stackoverflow.com/questions/61546787
复制相似问题