运行AWS Glue crawler时,它无法识别时间戳列。
我已经在我的CSV文件中正确地格式化了ISO8601时间戳。首先,我希望Glue自动将它们归类为时间戳,但它没有。
我还尝试了这个链接https://docs.aws.amazon.com/glue/latest/dg/custom-classifier.html中的自定义时间戳分类器
下面是我的分类器的样子
这也无法正确地对我的时间戳进行分类。
例如,我将我的数据放入grok调试器(https://grokdebug.herokuapp.com/)中
id,iso_8601_now,iso_8601_yesterday
0,2019-05-16T22:47:33.409056,2019-05-15T22:47:33.409056
1,2019-05-16T22:47:33.409056,2019-05-15T22:47:33.409056
这两个都匹配
%{TIMESTAMP_ISO8601:timestamp}
%{ %{HOUR}:?%{MINUTE}(?::?%{SECOND})?%{ISO8601_TIMEZONE}?}-%{MONTHNUM}-%{MONTHDAY}T时间
import csv
from datetime import datetime, timedelta
with open("timestamp_test.csv", 'w', newline='') as f:
w = csv.writer(f, delimiter=',')
w.writerow(["id", "iso_8601_now", "iso_8601_yesterday"])
for i in range(1000):
w.writerow([i, datetime.utcnow().isoformat(), (datetime.utcnow() - timedelta(days=1)).isoformat()])
我希望AWS glue能自动将iso_8601列归类为时间戳。即使在添加自定义grok分类器时,它仍然不会将这两列中的任何一列分类为时间戳。
这两列都被归类为字符串。
crawler上的分类器处于活动状态
crawler对timestamp_test表的输出
{
"StorageDescriptor": {
"cols": {
"FieldSchema": [
{
"name": "id",
"type": "bigint",
"comment": ""
},
{
"name": "iso_8601_now",
"type": "string",
"comment": ""
},
{
"name": "iso_8601_yesterday",
"type": "string",
"comment": ""
}
]
},
"location": "s3://REDACTED/_csv_timestamp_test/",
"inputFormat": "org.apache.hadoop.mapred.TextInputFormat",
"outputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
"compressed": "false",
"numBuckets": "-1",
"SerDeInfo": {
"name": "",
"serializationLib": "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe",
"parameters": {
"field.delim": ","
}
},
"bucketCols": [],
"sortCols": [],
"parameters": {
"skip.header.line.count": "1",
"sizeKey": "58926",
"objectCount": "1",
"UPDATED_BY_CRAWLER": "REDACTED",
"CrawlerSchemaSerializerVersion": "1.0",
"recordCount": "1227",
"averageRecordSize": "48",
"CrawlerSchemaDeserializerVersion": "1.0",
"compressionType": "none",
"classification": "csv",
"columnsOrdered": "true",
"areColumnsQuoted": "false",
"delimiter": ",",
"typeOfData": "file"
},
"SkewedInfo": {},
"storedAsSubDirectories": "false"
},
"parameters": {
"skip.header.line.count": "1",
"sizeKey": "58926",
"objectCount": "1",
"UPDATED_BY_CRAWLER": "REDACTED",
"CrawlerSchemaSerializerVersion": "1.0",
"recordCount": "1227",
"averageRecordSize": "48",
"CrawlerSchemaDeserializerVersion": "1.0",
"compressionType": "none",
"classification": "csv",
"columnsOrdered": "true",
"areColumnsQuoted": "false",
"delimiter": ",",
"typeOfData": "file"
}
}
发布于 2019-05-19 14:41:14
如果必须使用ISO8601格式,请添加此Serde参数'timestamp.formats'='yyyy-MM-dd\'T\'HH:mm:ss.SSSSSS'
您可以通过Glue(1)更改该表,也可以从Athena(2)重新创建该表:
CREATE EXTERNAL TABLE `table1`(
`id` bigint,
`iso_8601_now` timestamp,
`iso_8601_yesterday` timestamp)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'field.delim' = ',',
'timestamp.formats'='yyyy-MM-dd\'T\'HH:mm:ss.SSSSSS')
LOCATION
's3://REDACTED/_csv_timestamp_test/'
发布于 2019-09-09 17:59:31
发布于 2019-09-10 17:36:06
时间%{
:timestamp}
https://stackoverflow.com/questions/56177686
复制