当前位置:网站首页>《SAS编程和数据挖掘商业案例》学习笔记# 19
《SAS编程和数据挖掘商业案例》学习笔记# 19
2022-07-05 20:48:00 【全栈程序员站长】
大家好,又见面了,我是全栈君,今天给大家准备了Idea注册码。
继续《SAS编程与数据挖掘商业案例》学习笔记,本文側重数据处理实践。包含:HASH对象、自己定义format、以及功能强大的正則表達式
一:HASH对象
Hash对象又称散列表,是依据关键码值而直接进行訪问的数据结构。是依据关键码值而直接进行訪问的数据结构。
sas提供了两个类来处理哈希表。用于存储数据的hash和用于遍历的hiter,hash类提供了查找、加入、改动、删除等方法,hiter提供了用于定位和遍历的first、next等方法。
长处:键值的查找是在内存中进行的,有利于提高性能;
hash表能够在数据步执行时,动态的加入更新或删除观測。
hash表中能够非常快的定位数据,降低查找次数。
经常用法:
definekey:定义键
Definedata:定义值
definedone:定义完毕。能够加载数据
add:加入键值。如在hash表中已存在,则忽略;
replace:假设健在hash表中存在,则替换。假设不存在则加入键值
remove:清除键值对
find:查找健值,假设存在则将值写入相应变量
check:查找键值,假设存在则返回rc=0,不改动当前变量的值;
output:将hash表输出到数据集
clear:清空hash表,但并不删除对象
equal:推断两个hash类是否相等
find方法的演示样例:
libname chapt12 ‘f:\data_model\book_data\chapt12’;
data results;
if _n_=0 then set chapt12.participants;
if _n_ = 1 then do;
declare hash h(dataset:’chapt12.participants’);
h.definekey(‘name’);
h.definedata(‘gender’, ‘treatment’);
h.definedone();
end;
set chapt12.weight;
if h.find() = 0 then
output;
run;
hiter对象的引例:
data patients;
length patient_id $ 16 discharge 8;
input patient_id discharge:date9.;
datalines;
smith-4123 15mar2004
hagen-2834 23apr2004
smith-2437 15jan2004
flinn-2940 12feb2004
;
data _null_;
if _n_=0 then set patients;
declare hash ht(dataset:”patients”,ordered:”ascending”);
ht.definekey(“patient_id”);
ht.definedata(“patient_id”, “discharge”);
ht.definedone();
declare hiter iter(“ht”);
rc = iter.first();
do while (rc=0);
put patient_id discharge:date9.;
rc = iter.next();
end;
run;
用declare hiter iter(“ht”);给hash表ht定义了一个遍历器iter,之后调用first方法将遍历器定位到hash表的第一条观測,然后使用next方法遍历hash表中的全部记录并输出。
商业实战–两个数据集的合并:
data both1(drop=rc);
declare hash plan ();
rc = plan.definekey (‘plan_id’);
rc = plan.definedata (‘plan_desc’);
rc = plan.definedone ();
do until (eof1) ;
set chapt12.plans end = eof1;
rc = plan.add ();
end;
do until (eof2) ;
set chapt12.members end = eof2;
call missing(plan_desc);
rc = plan.find ();
output;
end;
stop;
run;
上述程序能够简化为:
data both2;
length plan_id 3 plan_desc 20;
if _n_ = 1 then do;
declare hash h(dataset:’chapt12.plans’);
h.definekey(‘plan_id’);
h.definedata(‘plan_desc’);
h.definedone();
call missing(plan_desc);
end;
set chapt12.members;
rc=h.find();
run;
二:format
自己定义format:
Proc Format;
Value $ Sex_Fmt
‘F’=’女‘
‘M’=’男‘
Other = ‘未知‘;
Value Age_Dur
Low-10=”10岁下面“
11-13=”11-13岁“
14-<15=”14-15″
15-High=”15岁以上“;
Run;
应用:
Data test;
Set sashelp.class(keep=sex age);
x=put(sex,$sex_fmt);y=put(age,age_dur.);
Run;
三:正則表達式:
/…/ 一个正則表達式的起止。
| 数项之间的选择,“或”运算;
() 匹配组,标记一个子表达式的開始和结束位置。
. 除换行符以外的随意字符。
\w 任一单词字符,数字大写和小写字母以及下划线
\W 任一非单词字符
\s 任一空白字符,包含空格、制表符、换行符、回车符、中文全角空格等;
\S 任一非空白字符,
\d 0-9任一数字
\D 任一非数字字符
[…]
[^…]
[a-z] 从a到z
[^a-z] 不在从a到z范围内的随意字符
^ 匹配输入字符串的開始位置
$ 匹配输入字符串的结尾位置
\b 描写叙述单词的前或后边界
\B 表示非单词边界
* 匹配0次或多次
+ 匹配一次或多次
? 匹配零次或 一次
{n} 匹配n次
{n,} 匹配n次以上
{n,m} 匹配n到m次
经常使用函数:
Prxparse 定义一个正則表達式
Prxmatch 返回匹配模式的首次匹配位置
Call prxsubstr 返回匹配模式在目标字符串的開始位置和长度
Prxposn 返回正則表達式子表达式相应的匹配模式值
Call prxposn 返回正則表達式子表达式相应的匹配模式和长度
Cal l prxnext 返回匹配模式在目标字符串中的多个匹配位置和长度
Prxchange 替代匹配模式的值
Call prxchange 替代匹配模式的值
eg1:
data _null_;
if _n_ = 1 then pattern_num = rxparse(“/cat/”);
retain pattern_num;
input string $30.;
position = rxmatch(pattern_num,string);
file print;
put pattern_num= string= position=;
datalines;
there is a cat in this line.
does not match cat
cat in the beginning
at the end, a cat
cat
;
run;
eg2:数据验证
data match_phone;
set chapt12.phone_numbers;
if _n_ = 1 then pattern = prxparse(“/\(\d\d\d\) ?
\d\d\d-\d{4}/”);
retain pattern;
if prxmatch(pattern,phone) gt 0 then output;
run;
找出不匹配的手机号码
data unmatch_phone;
set chapt12.phone_numbers;
where not prxmatch(“/\(\d\d\d\) ?
\d\d\d-\d{4}/”,phone);
run;
Eg3:提取匹配某种模式的字符串
data extract;
if _n_ = 1 then do;
pattern = prxparse(“/\(\d\d\d\) ?
\d\d\d-\d{4}/”);
if missing(pattern) then do;
put “error in compiling regular expression”;
stop;
end;
end;
retain pattern;
length number $ 15;
input string $char80.;
call prxsubstr(pattern,string,start,length);
if start gt 0 then do;
number = substr (string,start,length);
number = compress(number,” “);
output;
end;
keep number;
datalines;
this line does not have any phone numbers on it
this line does: (123)345-4567 la di la di la
also valid (123) 999-9999
two numbers here (333)444-5555 and (800)123-4567
;
run;
eg4:提取名字
data ReversedNames;
input name & $32.;
datalines;
Jones, Fred
Kavich, Kate
Turley, Ron
Dulix, Yolanda
;
data FirstLastNames;
length first last $ 16;
keep first last;
retain re;
if _N_ = 1 then
re = prxparse(‘/(\w+), (\w+)/’);
set ReversedNames;
if prxmatch(re, name) then
do;
last = prxposn(re, 1, name);
first = prxposn(re, 2, name);
end;
run;
注:1,2分别代表正則表達式中的两个组
eg5:提取符合规定的名字
data old;
input name $60.;
datalines;
Judith S Reaveley
Ralph F. Morgan
Jess Ennis
Carol Echols
Kelly Hansen Huff
Judith
Nick
Jones
;
data new;
length first middle last $ 40;
re1 = prxparse(‘/(\S+)\s+([^\s]+\s+)?(\S+)/o’);
re2 = prxparse(‘/(\S+)(\s+)([^\s]+\s+)(?)(\S+)/o’);
set old;
id1=prxmatch(re1, name);
id2=prxmatch(re2, name);
if id1 then
do;
first = prxposn(re1, 1, name);
middle = prxposn(re1, 2, name);
last = prxposn(re1, 3, name);
end;
if id2 then test=prxposn(re1, 4, name);
put test=;
run;
Eg6:返回匹配模式的多个位置
data _null_;
expressionid = prxparse(‘/[crb]at/’);
text = ‘the woods have a bat, cat, and a rat!’;
start = 1;
stop = length(text);
call prxnext(expressionid, start, stop, text, position, length);
do while (position > 0);
found = substr(text, position, length);
put found= position= length=;
call prxnext(expressionid, start, stop, text, position, length);
end;
run;
注:首次运行call prxnext返回一个position,然后进入循环,在抽取满足条件的子串中。再次运行all prxnext,此时会返回下一个匹配的position;
Eg7:替换文本
data cat_and_mouse;
input text $char40.;
length new_text $ 80;
if _n_ = 1 then match = prxparse(“s/[Cc]at/mouse/”);
retain match;
call prxchange(match,-1,text,new_text,len,trunc,num);
if trunc then put “note: new_text was truncated”;
datalines;
the Cat in the hat
there are two cat cats in this line
here is no replacement
;
run;
版权声明:本文博客原创文章。博客,未经同意,不得转载。
发布者:全栈程序员栈长,转载请注明出处:https://javaforall.cn/117664.html原文链接:https://javaforall.cn
边栏推荐
- Cutting edge technology for cultivating robot education creativity
- Abnova丨荧光染料 620-M 链霉亲和素方案
- 研学旅游实践教育的开展助力文旅产业发展
- 最长摆动序列[贪心练习]
- Fundamentals - configuration file analysis
- Is it safe to open a stock account by mobile phone? My home is relatively remote. Is there a better way to open an account?
- Make Jar, Not War
- Codeforces Round #804 (Div. 2) - A, B, C
- Hongmeng OS' fourth learning
- 欢迎来战,赢取丰厚奖金:Code Golf 代码高尔夫挑战赛正式启动
猜你喜欢
1. Strengthen learning basic knowledge points
中国的软件公司为什么做不出产品?00后抛弃互联网;B站开源的高性能API网关组件|码农周刊VIP会员专属邮件周报 Vol.097
Fundamentals - configuration file analysis
Cutting edge technology for cultivating robot education creativity
Chemical properties and application instructions of prosci Lag3 antibody
Duchefa d5124 md5a medium Chinese and English instructions
Analysis of steam education mode under the integration of five Education
Specification of protein quantitative kit for abbkine BCA method
PHP反序列化+MD5碰撞
Duchefa low melting point agarose PPC Chinese and English instructions
随机推荐
Duchefa p1001 plant agar Chinese and English instructions
Applet global configuration
Make Jar, Not War
Monorepo管理方法论和依赖安全
How to renew NPDP? Here comes the operation guide!
[UE4] unrealinsight obtains the real machine performance test report
Abnova blood total nucleic acid purification kit pre installed relevant instructions
小程序事件绑定
When steam education enters personalized information technology courses
Typhoon is coming! How to prevent typhoons on construction sites!
[quick start of Digital IC Verification] 2. Through an example of SOC project, understand the architecture of SOC and explore the design process of digital system
国外LEAD美国简称对照表
Codeforces Round #804 (Div. 2) - A, B, C
hdu2377Bus Pass(构建更复杂的图+spfa)
ClickHouse 复制粘贴多行sql语句报错
Is it safe to open an account online? Where can I get a low commission?
Interpreting the daily application functions of cooperative robots
3.3、项目评估
Is the securities account given by the school of Finance and business safe? Can I open an account?
[Yugong series] go teaching course in July 2022 004 go code Notes