当前位置:网站首页>数仓分层设计及数据同步问题,,220728,,,,
数仓分层设计及数据同步问题,,220728,,,,
2022-07-29 08:08:00 【啊六六六】

制作技术架构图????????????
Hadoop容器:最容易遇到进程没有启动成功的问题
50070
8088



全量同步、全量覆盖、新增同步、新增及更新同步
快照表、全量表、增量表、拉链表

> >>
覆盖和追加
重定向:重新定义一个新的方向
>:输出重定向
< :输入重定向

ephemeral:短暂的;
文件本身算作一个副本,,

combiner,spark map预聚合,
join,reduce 中shuffle join,

Support

约束

主键、唯一、非空、外键、默认值

维度表数据量没那么大
维度:数据量少,很少发生拜年话
变化
每次都全量覆盖



维度退化:将维度退化到事实表
不是所有维度都能退化的
维度退化目的在于:减少维度表的个数,减少了关联的次数,来提高性能
维度退化缺点:冗余度增加
省市县乡三级联动,不能退化维度,,

维度建模流程:业务调研
选择业务过程:业务调研、数据调研

-m

找管理要,连接地址,,

--fields-terminated-by "\001" \
hive默认的分隔符

Hive:将HDFS与Hive表构建一个映射关系
location:指定Hive表对应的HDFS地址
不指定,默认/user/hive/warehouse
指定了:Hive表对应HDFS目录就是指定的目录
功能:存放Hive表的数据 目录
查询:Hive就去读取映射的HDFS目录

自动化建表需要依赖于Sqoop产生的Schema文件
这样做有个前提吧,就是oracle的表结构hive中能够适用
Sqoop会自动转换,并且Hive支持这个格式



功能:读取数据放入一个变量中
Linux:默认输入和输出都是命令行
不想输出在命令行,使用输出重定向
linux下:x????????

^^:将表名转换为大写
反斜杠,转义,,
分区表



命令不讲顺序,,
--outdir:指定将生成Java文件和Schema文件存储的位置
.java文件里面是些什么?
MapReduce执行文件,,
运行101文件,休眠30s,执行一两个小时,,
101个Schema + 1个备份压缩文件
py写数据处理的程序多,调度脚本一般都用shell

cur_time=`date "+%F %T"`
![]()
![]()
#!/usr/bin/env bash
# /bin/bash
biz_date=20210101
biz_fmt_date=2021-01-01
dw_parent_dir=/data/dw/ods/one_make/full_imp
workhome=/opt/sqoop/one_make
full_imp_tables=${workhome}/full_import_tables.txt
mkdir ${workhome}/log
orcl_srv=oracle.bigdata.cn
orcl_port=1521
orcl_sid=helowin
orcl_user=ciss
or_pwd=123456
sqoop_import_params="sqoop import -Dmapreduce.job.user.classpath.first=true --outer ${workhome}/java_code --as-avrodatafile"
sqoop_jdbc_params="--connect jdbc:oracle:thin:@${orcl_srv}$:{orcl_port}:${orcl_sid} --username ${orcl_user} --password ${orcl_pwd}"
#load hadoop/sqoop env
source /etc/profile
while read p:do
#parallel execution import
${sqoop_import_params} ${sqoop_jdbc_params} --target-dir ${dw_parent_dir}/${p}/${biz_data} --table ${p^^} -m 1&
#?????????
cur_time=`date"+%F %T"`
echo "${cur_time}:${sqoop_import_params} ${sqoop_jdbc_params} --target-dir ${dw_parent_dir}/${p}/${biz_date} --table ${p} -m 1 &">>${workhome}/log/
${biz_fmt_date}_full_imp.log
sleep 30
done <${full_imp_tables} 

?这个变量是LINUX系统使用的,用于表示上个命令执行过程中是否有错误,没有错误则为0,那$? 就是取这个变量的值,亦即获取上一个命令的执行是否出错的标志,然后IF里和0做了比较。
p, --parents需要时创建上层目录,如目录早已存在则不当作错误
backup (文件等的)备份; 后援; 增援;
preview
画技术架构图????

看回顾md或视频???


重启yarn,重启spark中thriftServer,,

有时间把测试数据库制作自动化shell??
#!/usr/bin/env bash
# /bin/bash
biz_date=20210101
biz_fmt_date=2021-01-01
dw_parent_dir=/data/dw/ods/one_make/test_full_imp
workhome=/opt/datas/shell
full_imp_tables=${workhome}/test_full_table.txt
mkdir ${workhome}/log
orcl_srv=oracle.bigdata.cn
orcl_port=1521
orcl_sid=helowin
orcl_user=ciss
orcl_pwd=123456
sqoop_import_params="sqoop import -Dmapreduce.job.user.classpath.first=true --outdir ${workhome}/java_code --as-avrodatafile"
sqoop_jdbc_params="--connect jdbc:oracle:thin:@${orcl_srv}:${orcl_port}:${orcl_sid} --username ${orcl_user} --password ${orcl_pwd}"
# load hadoop/sqoop env
source /etc/profile
while read p; do
# parallel execution import
${sqoop_import_params} ${sqoop_jdbc_params} --target-dir ${dw_parent_dir}/${p}/${biz_date} --table ${p^^} -m 1 &
cur_time=`date "+%F %T"`
echo "${cur_time}: ${sqoop_import_params} ${sqoop_jdbc_params} --target-dir ${dw_parent_dir}/${p}/${biz_date} --table ${p} -m 1 &" >> ${workhome}/log/${biz_fmt_date}_full_imp.log
sleep 30
done < ${full_imp_tables}有时间看菜鸟shell语法??
边栏推荐
- Processes and threads
- [beauty of software engineering - column notes] 28 | what is the core competitiveness of software engineers? (next)
- [beauty of software engineering - column notes] 22 | how to do a good job in technology selection for the project?
- Chapter contents of the romance of the Three Kingdoms
- Some tools, plug-ins and software links are shared with you~
- Unity beginner 4 - frame animation and protagonist attack (2D)
- 125kHz wake-up function 2.4GHz single transmitter chip-si24r2h
- [memo] summary of the reasons why SSH failed? Remember to come next time.
- LVM logical volume group management
- [cryoEM] Introduction to FSC, Fourier shell correlation
猜你喜欢

阿里巴巴政委体系-第四章、政委建在连队上

Tle5012b+stm32f103c8t6 (bluepill) reading angle data

STM32 printf problem summary semihosting microlib understanding

Ue4/ue5 C disk enlargement processing

华为无线设备配置利用WDS技术部署WLAN业务

Autojs微信研究:微信自动发送信息机器人最终成品(有效果演示)

Useful websites

Dp1332e multi protocol highly integrated contactless read-write chip

Unity Shader学习(六)实现雷达扫描效果

TCP——滑动窗口
随机推荐
SQL 面试碰到的一个问题
[paper reading | cryoelectron microscope] interpretation of the new subtomogram averaging method in relion 4.0
JVM garbage collection mechanism (GC)
Redshift 2.6.41 for maya2018 watermark removal
Some simple uses of crawler requests Library
Simple calculator wechat applet project source code
Very practical shell and shellcheck
(Video + graphic) machine learning introduction series - Chapter 5 machine learning practice
"Swiss Army Knife" -nc in network tools
[memo] summary of the reasons why SSH failed? Remember to come next time.
Unicode私人使用区域(Private Use Areas)
[beauty of software engineering - column notes] 30 | make good use of source code management tools to make your collaboration more efficient
Unity Shader学习(六)实现雷达扫描效果
File system I
Ue4/ue5 C disk enlargement processing
torch.nn.functional.one_ hot()
[introduction to cryoelectron microscopy] Caltech open class course notes part 3:image formation
Explanation and closing method of server 135, 137, 138, 139, 445 and other ports
[note] the art of research (understand the importance of the problem)
Chapter contents of the romance of the Three Kingdoms