当前位置:网站首页>数仓分层设计及数据同步问题,,220728,,,,
数仓分层设计及数据同步问题,,220728,,,,
2022-07-29 08:08:00 【啊六六六】

制作技术架构图????????????
Hadoop容器:最容易遇到进程没有启动成功的问题
50070
8088



全量同步、全量覆盖、新增同步、新增及更新同步
快照表、全量表、增量表、拉链表

> >>
覆盖和追加
重定向:重新定义一个新的方向
>:输出重定向
< :输入重定向

ephemeral:短暂的;
文件本身算作一个副本,,

combiner,spark map预聚合,
join,reduce 中shuffle join,

Support

约束

主键、唯一、非空、外键、默认值

维度表数据量没那么大
维度:数据量少,很少发生拜年话
变化
每次都全量覆盖



维度退化:将维度退化到事实表
不是所有维度都能退化的
维度退化目的在于:减少维度表的个数,减少了关联的次数,来提高性能
维度退化缺点:冗余度增加
省市县乡三级联动,不能退化维度,,

维度建模流程:业务调研
选择业务过程:业务调研、数据调研

-m

找管理要,连接地址,,

--fields-terminated-by "\001" \
hive默认的分隔符

Hive:将HDFS与Hive表构建一个映射关系
location:指定Hive表对应的HDFS地址
不指定,默认/user/hive/warehouse
指定了:Hive表对应HDFS目录就是指定的目录
功能:存放Hive表的数据 目录
查询:Hive就去读取映射的HDFS目录

自动化建表需要依赖于Sqoop产生的Schema文件
这样做有个前提吧,就是oracle的表结构hive中能够适用
Sqoop会自动转换,并且Hive支持这个格式



功能:读取数据放入一个变量中
Linux:默认输入和输出都是命令行
不想输出在命令行,使用输出重定向
linux下:x????????

^^:将表名转换为大写
反斜杠,转义,,
分区表



命令不讲顺序,,
--outdir:指定将生成Java文件和Schema文件存储的位置
.java文件里面是些什么?
MapReduce执行文件,,
运行101文件,休眠30s,执行一两个小时,,
101个Schema + 1个备份压缩文件
py写数据处理的程序多,调度脚本一般都用shell

cur_time=`date "+%F %T"`
![]()
![]()
#!/usr/bin/env bash
# /bin/bash
biz_date=20210101
biz_fmt_date=2021-01-01
dw_parent_dir=/data/dw/ods/one_make/full_imp
workhome=/opt/sqoop/one_make
full_imp_tables=${workhome}/full_import_tables.txt
mkdir ${workhome}/log
orcl_srv=oracle.bigdata.cn
orcl_port=1521
orcl_sid=helowin
orcl_user=ciss
or_pwd=123456
sqoop_import_params="sqoop import -Dmapreduce.job.user.classpath.first=true --outer ${workhome}/java_code --as-avrodatafile"
sqoop_jdbc_params="--connect jdbc:oracle:thin:@${orcl_srv}$:{orcl_port}:${orcl_sid} --username ${orcl_user} --password ${orcl_pwd}"
#load hadoop/sqoop env
source /etc/profile
while read p:do
#parallel execution import
${sqoop_import_params} ${sqoop_jdbc_params} --target-dir ${dw_parent_dir}/${p}/${biz_data} --table ${p^^} -m 1&
#?????????
cur_time=`date"+%F %T"`
echo "${cur_time}:${sqoop_import_params} ${sqoop_jdbc_params} --target-dir ${dw_parent_dir}/${p}/${biz_date} --table ${p} -m 1 &">>${workhome}/log/
${biz_fmt_date}_full_imp.log
sleep 30
done <${full_imp_tables} 

?这个变量是LINUX系统使用的,用于表示上个命令执行过程中是否有错误,没有错误则为0,那$? 就是取这个变量的值,亦即获取上一个命令的执行是否出错的标志,然后IF里和0做了比较。
p, --parents需要时创建上层目录,如目录早已存在则不当作错误
backup (文件等的)备份; 后援; 增援;
preview
画技术架构图????

看回顾md或视频???


重启yarn,重启spark中thriftServer,,

有时间把测试数据库制作自动化shell??
#!/usr/bin/env bash
# /bin/bash
biz_date=20210101
biz_fmt_date=2021-01-01
dw_parent_dir=/data/dw/ods/one_make/test_full_imp
workhome=/opt/datas/shell
full_imp_tables=${workhome}/test_full_table.txt
mkdir ${workhome}/log
orcl_srv=oracle.bigdata.cn
orcl_port=1521
orcl_sid=helowin
orcl_user=ciss
orcl_pwd=123456
sqoop_import_params="sqoop import -Dmapreduce.job.user.classpath.first=true --outdir ${workhome}/java_code --as-avrodatafile"
sqoop_jdbc_params="--connect jdbc:oracle:thin:@${orcl_srv}:${orcl_port}:${orcl_sid} --username ${orcl_user} --password ${orcl_pwd}"
# load hadoop/sqoop env
source /etc/profile
while read p; do
# parallel execution import
${sqoop_import_params} ${sqoop_jdbc_params} --target-dir ${dw_parent_dir}/${p}/${biz_date} --table ${p^^} -m 1 &
cur_time=`date "+%F %T"`
echo "${cur_time}: ${sqoop_import_params} ${sqoop_jdbc_params} --target-dir ${dw_parent_dir}/${p}/${biz_date} --table ${p} -m 1 &" >> ${workhome}/log/${biz_fmt_date}_full_imp.log
sleep 30
done < ${full_imp_tables}有时间看菜鸟shell语法??
边栏推荐
- In the MySQL connector of flynk CDC, the MySQL field is varbinary, which is officially
- Some simple uses of crawler requests Library
- Unity beginner 1 - character movement control (2D)
- Huawei wireless device configuration uses WDS technology to deploy WLAN services
- Usage of torch.tensor.to
- Unity Shader学习(六)实现雷达扫描效果
- [skill accumulation] presentation practical skill accumulation, common sentence patterns
- The software package is set to - > Yum source
- Detailed explanation of two modes of FTP
- MySQL rownum implementation
猜你喜欢

BiSeNet v2

Qt/pyqt window type and window flag
![[freeze electron microscope] analysis of the source code of the subtomogram alignment function of relion4.0 (for self use)](/img/fe/0efdd151f9661d5cd06a79b7266754.png)
[freeze electron microscope] analysis of the source code of the subtomogram alignment function of relion4.0 (for self use)

(视频+图文)机器学习入门系列-第5章 机器学习实践

BiSeNet v2

数字人民币时代隐私更安全
![[cryoelectron microscope] relation4.0 - subtomogram tutorial](/img/5b/5364fbe68c495b67d9db5ed9bec2ac.png)
[cryoelectron microscope] relation4.0 - subtomogram tutorial

Redshift 2.6.41 for maya2018 watermark removal

V-Ray 5 acescg workflow settings

Compare three clock circuit schemes of single chip microcomputer
随机推荐
[beauty of software engineering - column notes] 29 | automated testing: how to kill bugs in the cradle?
Low power Bluetooth 5.0 chip nrf52832-qfaa
In the MySQL connector of flynk CDC, the MySQL field is varbinary, which is officially
The software package is set to - > Yum source
STM32 detection signal frequency
Day 014 2D array exercise
Data warehouse modeling, what is wide table? How to design? Advantages and disadvantages
[experience] relevant configuration of remote connection to intranet server through springboard machine
Dp4301-sub-1g highly integrated wireless transceiver chip
Some tools, plug-ins and software links are shared with you~
STM32 printf problem summary semihosting microlib understanding
(Video + graphic) machine learning introduction series - Chapter 5 machine learning practice
[noi simulation] computational geometry (convex hull, violence, and search set)
STM32 serial port garbled
阿里巴巴政委体系-第一章、政委建在连队上
An Optimal Buffer Management Scheme with Dynamic Thresholds论文总结
[dry goods memo] 50 kinds of Matplotlib scientific research paper drawing collection, including code implementation
20 hacker artifacts
Research on autojs wechat: the final product of wechat automatic information sending robot (effective demonstration)
LVM logical volume group management