客路旅行（KLOOK客路旅行基于Apache）

公司招聘
2023-12-08 12:00
龙泉小编

1. 业务背景介绍

客路旅行（KLOOK）是一家专注于境外目的地旅游资源整合的在线旅行平台，提供景点门票、一日游、特色体验、当地交通与美食预订服务。覆盖全球100个国家及地区，支持12种语言和41种货币的支付系统,与超过10000家商户合作伙伴紧密合作，为全球旅行者提供10万多种旅行体验预订服务。KLOOK数仓RDS数据同步是一个很典型的互联网电商公司数仓接入层的需求。对于公司数仓，约60%以上的数据直接来源与业务数据库，数据库有很大一部分为托管的AWS RDS-MYSQL 数据库，有超100+数据库/实例。RDS直接通过来的数据通过标准化清洗即作为数仓的ODS层，公司之前使用第三方商业工具进行同步，限制为每隔8小时的数据同步，无法满足公司业务对数据时效性的要求，数据团队在进行调研及一系列poc验证后，最后我们选择Debezium+Kafka+Flink+Hudi的ods层pipeline方案，数据秒级入湖，后续数仓可基于近实时的ODS层做更多的业务场景需求。

2. 架构改进

2.1 改造前架构

客路旅行（KLOOK客路旅行基于Apache）

整体依赖于第三服务，通过Google alooma进行RDS全量增量数据同步，每隔8小时进行raw table的consolidation，后续使用data flow 每24小时进行刷入数仓ODS层

2.2 新架构

客路旅行（KLOOK客路旅行基于Apache）

1. 使用AWS DMS 数据迁移工具，将全量RDS Mysql 数据同步至S3存储中；
2. 通过Flink SQL Batch 作业将S3数据批量写入Hudi 表；
3. 建立Debeizum MySQL binlog 订阅任务，将binlog 数据实时同步至Kafka;
4. 通过Flink SQL 启动两个流作业，一个将数据实时写入Hudi，另一个作业将数据追加写入到S3，S3 binlog文件保存30天，以备数据回溯使用；
5. 通过hive-hudi meta data sync tools,同步hudi catalog数据至Hive，通过Hive/Trino提供OLAP数据查询。

2.3 新架构收益

• 数据使用及开发灵活度提升，地方放同步服务限制明显，改进后的架构易于扩展，并可以提供实时同步数据供其它业务使用；
• 数据延迟问题得到解决，基于Flink on Hudi 的实时数据写入，对于RDS数据摄入数仓可以缩短至分钟甚至秒级，对于一些库存、风控、订单类的数据可以更快的进行数据取数分析，整体从原来近8小时的consolidation缩减至5分钟；
• 成本更加可控，基于Flink on Hudi存算分离的架构，可以有效通过控制对数据同步计算处理资源配额、同步刷新数据表落盘时间、数据存储冷热归档等进行成本控制，与第三方服务成本整体对比预计可以缩减40%。

3. 实践要点

3.1 Debezium 增量Binlog同步配置

Kafka connect 关键配置信息

bootstrap.servers=localhost:9092 # unique name for the cluster, used in forming the Connect cluster group. Note that this must not conflict with consumer group IDs group.id=connect-cluster # The converters specify the format of data in Kafka and how to translate it into Connect data. Every Connect user will # need to configure these based on the format they want their data in when loaded from or stored into Kafka key.converter=org.apache.kafka.connect.json.JsonConverter value.converter=org.apache.kafka.connect.json.JsonConverter # Converter-specific settings can be passed in by prefixing the Converter's setting with the converter we want to apply key.converter.schemas.enable=true value.converter.schemas.enable=true # Topic to use for storing offsets. This topic should have many partitions and be replicated and compacted. # Kafka Connect will attempt to create the topic automatically when needed, but you can always manually create # the topic before starting Kafka Connect if a specific topic configuration is needed. # Most users will want to use the built-in default replication factor of 3 or in some cases even specify a larger value. # Since this means there must be at least as many brokers as the maximum replication factor used, we'd like to be able # to run this example on a single-broker cluster and so here we instead set the replication factor to 1. offset.storage.topic=connect-offsets # Topic to use for storing connector and task configurations; note that this should be a single partition, highly replicated, # and compacted topic. Kafka Connect will attempt to create the topic automatically when needed, but you can always manually create # the topic before starting Kafka Connect if a specific topic configuration is needed. # Most users will want to use the built-in default replication factor of 3 or in some cases even specify a larger value. # Since this means there must be at least as many brokers as the maximum replication factor used, we'd like to be able # to run this example on a single-broker cluster and so here we instead set the replication factor to 1. config.storage.topic=connect-configs # Topic to use for storing statuses. This topic can have multiple partitions and should be replicated and compacted. # Kafka Connect will attempt to create the topic automatically when needed, but you can always manually create # the topic before starting Kafka Connect if a specific topic configuration is needed. # Most users will want to use the built-in default replication factor of 3 or in some cases even specify a larger value. # Since this means there must be at least as many brokers as the maximum replication factor used, we'd like to be able # to run this example on a single-broker cluster and so here we instead set the replication factor to 1. status.storage.topic=connect-status

查询 MySQL 最近binlog file 信息

SQL MySQL [(none)]> show binary logs; | mysql-bin-changelog.094531 | 176317 | | mysql-bin-changelog.094532 | 191443 | | mysql-bin-changelog.094533 | 1102466 | | mysql-bin-changelog.094534 | 273347 | | mysql-bin-changelog.094535 | 141555 | | mysql-bin-changelog.094536 | 4808 | | mysql-bin-changelog.094537 | 146217 | | mysql-bin-changelog.094538 | 29607 | | mysql-bin-changelog.094539 | 141260 | +----------------------------+-----------+ MySQL [(none)]> show binlog events in 'mysql-bin-changelog.094539'; MySQL [(none)]> show binlog events in 'mysql-bin-changelog.094539' limit 10; +----------------------------+-----+----------------+------------+-------------+---------------------------------------------------------------------------+ | Log_name | Pos | Event_type | Server_id | End_log_pos | Info | +----------------------------+-----+----------------+------------+-------------+---------------------------------------------------------------------------+ | mysql-bin-changelog.094539 | 4 | Format_desc | 1399745413 | 123 | Server ver: 5.7.31-log, Binlog ver: 4 | | mysql-bin-changelog.094539 | 123 | Previous_gtids | 1399745413 | 194 | 90710e1c-f699-11ea-85c0-0ec6a6bed381:1-108842347 |

指定server name key 发送offset 记录到offset.storage.topic

$ ./bin/kafka-console-producer.sh -bootstrap-server localhost:9092 --topic connect-offsets --property "parse.key=true" --property "key.separator=>" gt;["test_servername",{"server":"test_servername"}]>{"ts_sec":1647845014,"file":"mysql-bin-changelog.007051","pos":74121553,"row":1,"server_id":1404217221,"event":2}

编辑task api 请求，启动debezium task

{ "name":"test_servername", "config":{ "connector.class":"io.debezium.connector.mysql.MySqlConnector", "snapshot.locking.mode":"none", "database.user":"db_user", "transforms.Reroute.type":"io.debezium.transforms.ByLogicalTableRouter", "database.server.id":"1820615119", "database.history.kafka.bootstrap.servers":"localhost:9092", "database.history.kafka.topic":"history-topic", "inconsistent.schema.handling.mode":"skip", "transforms":"Reroute", // 配置binlog数据转发到一个topic，默认一个表一个topic "database.server.name":"test_servername", "transforms.Reroute.topic.regex":"test_servername(.*)", "database.port":"3306", "include.schema.changes":"true", "transforms.Reroute.topic.replacement":"binlog_data_topic", "table.exclude.list":"table_test", "database.hostname":"host", "database.password":"******", "name":"test_servername", "database.whitelist":"test_db", "database.include.list":"test_db", "snapshot.mode":"schema_only_recovery" // 使用recovery模式从指定binlog文件的offset同步 } }

3.2 Hudi 全量接增量数据写入

在已经有全量数据在Hudi表的场景中，后续从kafka消费的binlog数据需要增量upsert到Hudi表。debezium的binlog格式携带每条数据更新的信息，需要将其解析为可直接插入的数据。

示例解析生成Flink SQL的Python代码

# 写入数据到ODS Raw表 insert_hudi_raw_query = ''' INSERT INTO {0}_ods_raw.{1} SELECT {2} FROM {0}_debezium_kafka.kafka_rds_{1}_log WHERE REGEXP(GET_JSON_OBJECT(payload, '$.source.table'), '^{3}

免责声明：本文内容来源于网络或用户投稿，龙泉人才网仅提供信息存储空间服务，不承担相关法律责任。若收录文章侵犯到您的权益/违法违规的内容，可请联系我们删除。

https://www.lqrc.cn/a/gongsi/86716.html

微信分享

关注微信

上一篇：客路旅行（旅企）

下一篇：暂无

客路旅行（KLOOK客路旅行基于Apache）

1. 业务背景介绍

2. 架构改进

2.1 改造前架构

2.2 新架构

2.3 新架构收益

3. 实践要点

3.1 Debezium 增量Binlog同步配置

3.2 Hudi 全量接增量数据写入

猜你喜欢

热门标签

随便看看

南充市招聘（转需）

服务员兼职（哪些地方需要兼职服务员）

大家工作量很大，很多企业不要有了人工智能大模型就瞎裁员

招聘软件有哪些（求职APP哪个好）

中建四局第六建筑工程有限公司（超级中央企业在合肥市）

阅读排行

徐州西苑招聘（重磅公示）

鉴黄师招聘（离职女鉴黄师自述）

韶关烟厂招聘（速看）

一建人证合一招聘（建筑工程招聘与求职5）

胶南信息港最新招聘（深度分析）

关注我们

客路旅行（KLOOK客路旅行基于Apache）

1. 业务背景介绍

2. 架构改进

2.1 改造前架构

2.2 新架构

2.3 新架构收益

3. 实践要点

3.1 Debezium 增量Binlog同步配置

3.2 Hudi 全量接增量数据写入

猜你喜欢

热门标签

随便看看

阅读排行

关注我们

微信公众号