当前位置:网站首页>Lucene hnsw merge optimization
Lucene hnsw merge optimization
2022-07-03 07:30:00 【chuanyangwang】
Before optimization
hnsw stay merge It will generate merged ordinals -> segments and segment ordinals. Mapping . This mapping increases overhead
org.apache.lucene.codecs.KnnVectorsWriter.VectorValuesMerger#nextDoc
public int nextDoc() throws IOException {
current = docIdMerger.next();
if (current == null) {
docId = NO_MORE_DOCS;
/* update the size to reflect the number of *non-deleted* documents seen so we can support
* random access. */
size = ord;
} else {
docId = current.mappedDocID;
ordMap[ord++] = ordBase[current.segmentIndex] + current.count - 1;
}
return docId;
}orderBase Generation
VectorValuesMerger(List<VectorValuesSub> subs, MergeState mergeState) throws IOException {
this.subs = subs;
docIdMerger = DocIDMerger.of(subs, mergeState.needsIndexSort);
int totalCost = 0, totalSize = 0;
for (VectorValuesSub sub : subs) {
totalCost += sub.values.cost();
totalSize += sub.values.size();
}
/* This size includes deleted docs, but when we iterate over docs here (nextDoc())
* we skip deleted docs. So we sneakily update this size once we observe that iteration is complete.
* That way by the time we are asked to do random access for graph building, we have a correct size.
*/
cost = totalCost;
size = totalSize;
ordMap = new int[size];
ordBase = new int[subs.size()];
int lastBase = 0;
for (int k = 0; k < subs.size(); k++) {
int size = subs.get(k).values.size();
ordBase[k] = lastBase;
lastBase += size;
}
docId = -1;
}from target How to get vectors
public float[] vectorValue(int target) throws IOException {
int unmappedOrd = ordMap[target];
int segmentOrd = Arrays.binarySearch(ordBase, unmappedOrd);
if (segmentOrd < 0) {
// get the index of the greatest lower bound
segmentOrd = -2 - segmentOrd;
}
while (segmentOrd < ordBase.length - 1 && ordBase[segmentOrd + 1] == ordBase[segmentOrd]) {
// forward over empty segments which will share the same ordBase
segmentOrd++;
}
return raSubs.get(segmentOrd).vectorValue(unmappedOrd - ordBase[segmentOrd]);
}After optimization
No longer used in various segment Add a layer above to realize . Instead, write the vectors to the temporary file in turn
org.apache.lucene.codecs.lucene90.Lucene90HnswVectorsWriter#writeVectorData
private static int[] writeVectorData(IndexOutput output, VectorValues vectors)
throws IOException {
int[] docIds = new int[vectors.size()];
int count = 0;
for (int docV = vectors.nextDoc(); docV != NO_MORE_DOCS; docV = vectors.nextDoc(), count++) {
// write vector
BytesRef binaryValue = vectors.binaryValue();
assert binaryValue.length == vectors.dimension() * Float.BYTES;
output.writeBytes(binaryValue.bytes, binaryValue.offset, binaryValue.length);
docIds[count] = docV;
}
if (docIds.length > count) {
return ArrayUtil.copyOfSubArray(docIds, 0, count);
}
return docIds;
}https://github.com/apache/lucene/pull/617
https://github.com/apache/lucene/pull/617
边栏推荐
猜你喜欢
![[Development Notes] cloud app control on device based on smart cloud 4G adapter gc211](/img/55/fea5fe315932b92993d21f861befbe.png)
[Development Notes] cloud app control on device based on smart cloud 4G adapter gc211

IO stream system and FileReader, filewriter

Use of file class

Common methods of file class

C代码生产YUV420 planar格式文件

Image recognition and detection -- Notes

Circuit, packet and message exchange

Common APIs

Various postures of CS without online line

【已解决】Unknown error 1146
随机推荐
Summary of abnormal mechanism of interview
Common analysis with criteria method
【已解决】Unknown error 1146
An overview of IfM Engage
你开发数据API最快多长时间?我1分钟就足够了
Various postures of CS without online line
Topic | synchronous asynchronous
Common operations of JSP
High concurrency memory pool
An overview of IfM Engage
"Baidu Cup" CTF game 2017 February, Web: blast-1
Understanding of class
Beginners use Minio
[solved] unknown error 1146
Lucene skip table
Custom generic structure
HCIA notes
twenty million two hundred and twenty thousand three hundred and nineteen
The difference between typescript let and VaR
Introduction of transformation flow