Since the EMOJI_DATA
"dict of dicts" was added in #189 that contains all the data now, I like to propose some further (breaking) changes to the package:
- Remove the old dicts
- Only generate the dict that is needed for the language
- Update the language data
- Specify what happens when an emoji is not translated
If we make any of these changes, I would suggest to update the version number to 2.0.0 to indicate that it is not backwards compatible.
I would implement these changes myself.
Please reply with your thoughts and if you have any other suggestions or objections.
Remove the old dicts
The old dicts like EMOJI_UNICODE_ENGLISH, UNICODE_EMOJI_ENGLISH, EMOJI_UNICODE_GERMAN,
etc are now generated from the dict of dicts EMOJI_DATA
.
We don't need any of the old UNICODE_EMOJI_*
dicts anymore, because we can just do EMOJI_DATA[unicode_point]
.
Currently they are public in the package, so removing them would be a breaking change.
Only generate the dict that is needed for the language
We still need the EMOJI_UNICODE_*
to reverse the dict of dict: EMOJI_UNICODE_ENGLISH[":apple:"] = unicode_point_apple
Currently we generate all EMOJI_UNICODE_ENGLISH, EMOJI_UNICOD_GERMAN, etc. when the package is imported.
I would like to generate them only if the language is actually used, because it is unlikely that someone uses different languages at the same time.
I suggest to use the following to store the dicts and only generate the data if needed:
_EMOJI_UNICODE = {
'en': None,
'es': None,
'pt': None,
'it': None,
'fr': None,
'de': None,
}
_ALIASES_UNICODE = {}
I would change to these variable names and prefix with _
to indicate that they are not public.
Because they are only generated when needed and otherwise empty, they should no be used and therefore should not be public.
~~Improve demojize performance~~
Implemented in Version 1.6.2
The speed of `demojize()` has become slower with the Python 3 releases. The problem is that we use a huge regular expression that consists of all unicode code points.
`len(emoji.get_emoji_regexp().pattern) = 19559`
This was fast in Python 2, but in Python 3 it is really slow (it is about 15x slower on Python 3.10 than on Python 2.7).
I am working on this already and I replaced the regular expression with a search in a [tree](https://en.wikipedia.org/wiki/Tree_(data_structure)) to improve the speed by about 5-10x, I will test this further to see if it works.
There was a suggestion in #30 to use ranges e.g [\U0001F1E6-\U0001F1FF]
for consecutive codepoints in the regular expression. I initially tested this a little bit, by replacing a few emojis with ranges by hand. It would also be faster than the current regular expression, but I think it would be harder to implement and it is unclear to me how much faster if would be in the end.
Update the language data
The emojiterra.com website uses the tag release-38 for the translations, so e.g. https://github.com/unicode-org/cldr/raw/release-38/common/annotations/de.xml It is from October 2020
There are three newer releases for the translations, release-38-1 from December 2020, release-39 from April 2021 and release-40 from end of October 2021.
I don't know if want to use the newer ones, or if we want to wait until emojiterra.com updates.
At the moment our translation data is a mixture of different (older) versions. When we update to either 38, 38.1, 39 or 40 it will change some existing translations. That is a problem if someone has stored a text with the old translation in their database, so it is a breaking change.
I would suggest to update at least to release-38, so we have the same names as emojiterra.com
Also we should think about how we deal with future updates: The translations can change with every release. A translation from release-38 might be different in release-39, so with the next update, we might have a breaking change again.
We should make a note in the README how we are going to deal with this in the future.
How to handle the case when a string contains an emoji that is supported in 'en' but not in the language selected?
The demojize()
method currently skips the emoji when it is not available in the selected language.
I think it is not obvious that this can happen. One would assume that after demojize(string, language='xy')
the string
does not contain any emoji anymore.
Example with :empty_nest: which is in Unicode 14 and has not been translated into Spanish 'es' yet:
>> text_with_emoji_from_version_14 = emoji.emojize(':knife: :empty_nest:')
>> emoji.demojize(text_with_emoji_from_version_14, language='en')
>> ':knife: :empty_nest:'
>> emoji.demojize(text_with_emoji_from_version_14, language='es')
>> ':knife: \U0001fab9'
I don't think we need to change this, but we should add a note to the README that explains what can happen.
If we want to change it, it would be a breaking change as well, so now would be the time to do it.