FeatureHasher
Convert a collection of features to a fixed-dimensional matrix using the hashing trick.
Note, this requires Jina>=2.2.4.
Example
Here I use FeatureHasher
to hash each sentence of Pride and Prejudice into a 128-dim vector, and then use .match
to find top-K similar sentences.
from jina import Document, DocumentArray, Flow
# load
d = Document(uri='https://www.gutenberg.org/files/1342/1342-0.txt').convert_uri_to_text()
# cut into non-empty sentences store in a DA
da = DocumentArray(Document(text=s.strip()) for s in d.text.split('\n') if s.strip())
# use FeatureHasher in a Flow
f = Flow().add(uses='jinahub://FeatureHasher')
embed_da = DocumentArray()
with f:
f.post('/', da, on_done=lambda req: embed_da.extend(req.docs), show_progress=True)
print('self-matching...')
embed_da.match(embed_da, exclude_self=True, limit=5, normalization=(1, 0))
print('total sentences: ', len(embed_da))
for d in embed_da:
print(d.text)
for m in d.matches:
print(m.scores['cosine'], m.text)
input()
[email protected][I]:🎉 Flow is ready to use!
🔗 Protocol: GRPC
🏠 Local access: 0.0.0.0:52628
🔒 Private network: 192.168.178.31:52628
🌐 Public address: 217.70.138.123:52628
⠹ DONE ━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0:00:01 100% ETA: 0 seconds 40 steps done in 1 second
total sentences: 12153
The Project Gutenberg eBook of Pride and Prejudice, by Jane Austen
*** END OF THE PROJECT GUTENBERG EBOOK PRIDE AND PREJUDICE ***
*** START OF THE PROJECT GUTENBERG EBOOK PRIDE AND PREJUDICE ***
production, promotion and distribution of Project Gutenberg-tm
Pride and Prejudice
By Jane Austen This eBook is for the use of anyone anywhere in the United States and
This eBook is for the use of anyone anywhere in the United States and
by the awkwardness of the application, and at length wholly
Elizabeth passed the chief of the night in her sister’s room, and
the happiest memories in the world. Nothing of the past was
charities and charitable donations in all 50 states of the United
In practice, you can implement matching and storing via an indexer inside Flow
. This example is only for demo purpose so any non-feature hashing related ops are implemented outside the Flow to avoid distraction.