当前位置:网站首页>用XGBoost迭代读取数据集
用XGBoost迭代读取数据集
2022-06-27 06:35:00 【Datawhale】
Datawhale干货
来源:Coggle数据科学
在大规模数据集进行读取进行训练的过程中,迭代读取数据集是一个非常合适的选择,在Pytorch中支持迭代读取的方式。接下来我们将介绍XGBoost的迭代读取的方式。
内存数据读取
class IterLoadForDMatrix(xgb.core.DataIter):
def __init__(self, df=None, features=None, target=None, batch_size=256*1024):
self.features = features
self.target = target
self.df = df
self.batch_size = batch_size
self.batches = int( np.ceil( len(df) / self.batch_size ) )
self.it = 0 # set iterator to 0
super().__init__()
def reset(self):
'''Reset the iterator'''
self.it = 0
def next(self, input_data):
'''Yield next batch of data.'''
if self.it == self.batches:
return 0 # Return 0 when there's no more batch.
a = self.it * self.batch_size
b = min( (self.it + 1) * self.batch_size, len(self.df) )
dt = pd.DataFrame(self.df.iloc[a:b])
input_data(data=dt[self.features], label=dt[self.target]) #, weight=dt['weight'])
self.it += 1
return 1调用方法(此种方式比较适合GPU训练):
Xy_train = IterLoadForDMatrix(train.loc[train_idx], FEATURES, 'target')
dtrain = xgb.DeviceQuantileDMatrix(Xy_train, max_bin=256)参考文档:
https://xgboost.readthedocs.io/en/latest/python/examples/quantile_data_iterator.html
外部数据迭代读取
class Iterator(xgboost.DataIter):
def __init__(self, svm_file_paths: List[str]):
self._file_paths = svm_file_paths
self._it = 0
super().__init__(cache_prefix=os.path.join(".", "cache"))
def next(self, input_data: Callable):
if self._it == len(self._file_paths):
# return 0 to let XGBoost know this is the end of iteration
return 0
X, y = load_svmlight_file(self._file_paths[self._it])
input_data(X, y)
self._it += 1
return 1
def reset(self):
"""Reset the iterator to its beginning"""
self._it = 0调用方法(此种方式比较适合CPU训练):
it = Iterator(["file_0.svm", "file_1.svm", "file_2.svm"])
Xy = xgboost.DMatrix(it)
# Other tree methods including ``hist`` and ``gpu_hist`` also work, but has some caveats
# as noted in following sections.
booster = xgboost.train({"tree_method": "approx"}, Xy)参考文档:
https://xgboost.readthedocs.io/en/stable/tutorials/external_memory.html

整理不易,点赞三连↓
边栏推荐
- TiDB 中的SQL 基本操作
- [openairinterface5g] rrcsetupcomplete for RRC NR resolution
- Unsafe中的park和unpark
- Cloud-Native Database Systems at Alibaba: Opportunities and Challenges
- Once spark reported an error: failed to allocate a page (67108864 bytes), try again
- TiDB 基本功能
- extendible hashing
- matlab GUI界面仿真直流电机和交流电机转速仿真
- Xiaomi Interviewer: let's talk about the proficient Registration Center for three days and three nights
- 获取地址url中的query参数指定参数方法
猜你喜欢

如何优雅的写 Controller 层代码?

Memory barrier store buffer, invalid queue

Quick realization of Bluetooth ibeacn function

Fractional Order PID control

云服务器配置ftp、企业官网、数据库等方法

Yolov6's fast and accurate target detection framework is open source

How to write controller layer code gracefully?

Active learning

0.0.0.0:x的含义

matlab GUI界面仿真直流电机和交流电机转速仿真
随机推荐
2022 CISP-PTE(一)文件包含
Matlab GUI interface simulation DC motor and AC motor speed simulation
仙人掌之歌——投石问路(1)
2018年数学建模竞赛-高温作业专用服装设计
Xiaomi Interviewer: let's talk about the proficient Registration Center for three days and three nights
Quick realization of Bluetooth ibeacn function
SQL 注入绕过(一)
【LeetCode】Day90-二叉搜索树中第K小的元素
HTAP in depth exploration Guide
An Empirical Evaluation of In-Memory Multi-Version Concurrency Control
TiDB 中的SQL 基本操作
Active learning
Yolov6's fast and accurate target detection framework is open source
NoViableAltException([email protected][2389:1: columnNameTypeOrConstraint : ( ( tableConstraint ) | ( columnNameT
Mathematical modeling contest for graduate students - optimal application of UAV in rescue and disaster relief
研究生数学建模竞赛-无人机在抢险救灾中的优化应用
Overview of database schema in tidb
C Primer Plus Chapter 11_ Strings and string functions_ Codes and exercises
POI export excle
技术人员创业一年心得