当前位置:网站首页>Federal learning: dividing non IID samples by Dirichlet distribution
Federal learning: dividing non IID samples by Dirichlet distribution
2022-06-30 02:55:00 【Illusory private school】
Python Wechat ordering applet course video
https://edu.csdn.net/course/detail/36074
Python Actual quantitative transaction financial management system
https://edu.csdn.net/course/detail/35475
We are 《Python Random sampling and probability distribution in ( Two )》 It describes how to use Python The existing library samples a probability distribution , Among them Dirichlet The distribution must be familiar to everyone . The probability density function of this distribution is
P(x;α)∝k∏i=1xαi−1ix=(x1,x2,…,xk),xi>0,k∑i=1xi=1α=(α1,α2,…,αk).αi>0P(\bm{x}; \bm{\alpha}) \propto \prod_{i=1}^{k} x_{i}^{\alpha_{i}-1} \
\bm{x}=(x_1,x_2,…,x_k),\quad x_i > 0 , \quad \sum_{i=1}^k x_i = 1\
\bm{\alpha} = (\alpha_1,\alpha_2,…, \alpha_k). \quad \alpha_i > 0
among α\bm{\alpha} Is the parameter .
We are studying in the Federation , Different assumptions are often made client The data set between does not satisfy independent identically distributed (Non-IID). So how do we compare an existing dataset to Non-IID Division ? We know that the generation distribution of labeled samples can be expressed as p(x,y)p(\bm{x}, y), We further write it p(x,y)=p(x|y)p(y)p(\bm{x}, y)=p(\bm{x}|y)p(y). If you want to estimate p(x|y)p(\bm{x}|y) The computational overhead is very large , But estimate p(y)p(y) The computing overhead is very small . So we analyze the samples according to the label distribution of the samples Non-IID Partitioning is a very efficient 、 Simple approach .
To make a long story short , The algorithm we adopt is to make every client The sample labels on the are distributed differently . We have KK Category tags ,NN individual client, The sample of each category label needs to be divided into different categories according to different proportions client On . Let's set the matrix X∈RK∗N\bm{X}\in \mathbb{R}^{K*N} Distribution matrix for category labels , Its row vector xk∈RN\bm{x}_k\in \mathbb{R}^N Presentation category kk In different client Probability distribution vector on ( Each dimension represents kk Samples of categories are divided into different client The proportion of ), The random vector is sampled from Dirichlet Distribution .
Accordingly , We can write the following partition algorithm :
import numpy as np
np.random.seed(42)
def split\_noniid(train\_labels, alpha, n\_clients):
'''
Parameter is alpha Of Dirichlet Distribution divides the data index into n\_clients A subset of
'''
n_classes = train_labels.max()+1
label_distribution = np.random.dirichlet([alpha]*n_clients, n_classes)
# (K, N) Class label distribution matrix X, Record each client How much of each category
class_idcs = [np.argwhere(train_labels==y).flatten()
for y in range(n_classes)]
# Record each K Sample subscripts corresponding to categories
client_idcs = [[] for _ in range(n_clients)]
# Record N individual client Respectively corresponding to the index of the sample set
for c, fracs in zip(class_idcs, label_distribution):
# np.split Categorize proportionally as k The sample is divided into N A subset of
# for i, idcs To traverse the i individual client The index of the corresponding sample set
for i, idcs in enumerate(np.split(c, (np.cumsum(fracs)[:-1]*len(c)).astype(int))):
client_idcs[i] += [idcs]
client_idcs = [np.concatenate(idcs) for idcs in client_idcs]
return client_idcs
Plus, we're EMNIST Call this function on the dataset to test , And visualize it . We set up client Number N=10N=10,Dirichlet Parameter vector of probability distribution α\bm{\alpha} Satisfy αi=1.0, i=1,2,…N\alpha_i=1.0,\space i=1,2,…N:
import torch
from torchvision import datasets
import numpy as np
import matplotlib.pyplot as plt
torch.manual_seed(42)
if __name__ == "\_\_main\_\_":
N_CLIENTS = 10
DIRICHLET_ALPHA = 1.0
train_data = datasets.EMNIST(root=".", split="byclass", download=True, train=True)
test_data = datasets.EMNIST(root=".", split="byclass", download=True, train=False)
n_channels = 1
input_sz, num_cls = train_data.data[0].shape[0], len(train_data.classes)
train_labels = np.array(train_data.targets)
# Let's make each client Different label The number of samples is different , To do so Non-IID Divide
client_idcs = split_noniid(train_labels, alpha=DIRICHLET_ALPHA, n_clients=N_CLIENTS)
# Show different client Different label Data distribution of
plt.figure(figsize=(20,3))
plt.hist([train_labels[idc]for idc in client_idcs], stacked=True,
bins=np.arange(min(train_labels)-0.5, max(train_labels) + 1.5, 1),
label=["Client {}".format(i) for i in range(N_CLIENTS)], rwidth=0.5)
plt.xticks(np.arange(num_cls), train_data.classes)
plt.legend()
plt.show()
The final visualization results are as follows :
You can see ,62 Category labels are in different client The distribution on is really different , It is proved that our sample partition algorithm is effective .
边栏推荐
- The rigorous judgment of ID number is accurate to the last place in the team
- Five cheapest wildcard SSL certificate brands
- What is a self signed certificate? Advantages and disadvantages of self signed SSL certificates?
- Unity timeline data binding
- Lua 基础知识
- 【直播笔记0629】 并发编程二:锁
- c#控制台格式化代码
- GTK interface programming (I): Environment Construction
- Jupyter notebook displays a collection of K-line graphs
- How can redis+aop customize annotations to achieve flow restriction
猜你喜欢

Time complexity analysis

Pytoch learning (II)

Série de tutoriels cmake - 02 - génération de binaires à l'aide du Code cmake

Two methods of SSL certificate format conversion

重磅来袭--UE5的开源数字孪生解决方案

oracle怎么设置密码复杂度及超时退出的功能

Interrupt operation: abortcontroller learning notes

2.8 【 weight of complete binary tree 】

Unity TimeLine 数据绑定
![[on] [DSTG] dynamic spatiotemporalgraph revolutionary neural networks for traffic data impact](/img/c3/f9d6399c931a006ca295bb1e3ac427.png)
[on] [DSTG] dynamic spatiotemporalgraph revolutionary neural networks for traffic data impact
随机推荐
2. < tag dynamic programming and 0-1 knapsack problem > lt.416 Split equal sum subset + lt.1049 Weight of the last stone II
Linear algebra Chapter 3 summary of vector and vector space knowledge points (Jeff's self perception)
Visual HTA form designer htamaker interface introduction and usage, Download | HTA VBS visual script writing
uniapp 地址转换经纬度
(graph theory) connected component (template) + strongly connected component (template)
三层交换机和二层交换机区别是什么
在php中字符串的概念是什么
Three solutions to forced hibernation of corporate computers
[dry goods sharing] the latest WHQL logo certification application process
备忘一下es6的export/import和类继承的用法
Linear algebra Chapter 4 Summary of knowledge points of linear equations (Jeff's self perception)
IDEA 远程调试 Remote JVM Debug
Pytoch learning (II)
Global and Chinese market of relay lens 2022-2028: Research Report on technology, participants, trends, market size and share
Network neuroscience——网络神经科学综述
Which is a good foreign exchange trading platform? Is it safe to have regulated funds?
CMake教程系列-03-依赖管理
FAQs for code signature and driver signature
Wechat applet page Jump and parameter transfer
What is the concept of string in PHP