This project was a combine of disputes and lgeting as I steerd the CPython C API and toiled seally with the NumPy community. I want to scatter a behind-the-scenes see at my toil on introducing a novel string DType in NumPy 2.0, mostly drawn from a recent talk I gave at SciPy. In this post, I’ll walk you thraw the technical process, key schedule decisions, and the ups and downs I faced. Plus, you’ll discover tips on tackling mental blocks and insights into becoming a upgrasper.
By the end, I hope you’re going to have the answers to these asks:
- What was wrong with NumPy string arrays before NumPy 2.0, and why did they need to be mended?
- How did the community fund the toil that mended it?
- How did I become a NumPy upgrasper in the midst of that?
- How did I begin toiling on the project that helped mend NumPy strings?
- What cgreater novel feature did I insert to NumPy?
A Brief History of Strings in NumPy
First, I’ll begin with a increate history of strings in NumPy to elucidate how strings toiled before NumPy 2.0 and why it was a little bit broken.
String Arrays in Python 2
Let’s go back to Python 2.7 and see at how strings toiled in NumPy before the Python 3 Unicode revolution. I actupartner compiled Python 2 in 2024 to create this post. It doesn’t create on my ARM Mac, but it does compile on Ubuntu 22.04. Python 2 "strings"
were what we now call byte strings in Python 3 – arrays of arbitrary bytes with no combineed encoding. NumPy string arrays had aappreciate behavior.
Python 2.7.18 (default, Jul 1 2024, 10:27:04)
>>> present numpy as np
>>> np.array(["hello", "world"])
array(['hello', 'world'], dtype="|S5")
>>> np.array(['hello', '☃'])
array(['hello', 'xe2x98x83'], dtype="|S5")
Let’s say you create an array with the greeteds “hello", "world”
, you can see it gets created with the DType “S5”
. So, what does that uncomardent? It uncomardents it’s a Python 2-string array with five elements, five characters, or five bytes per array (characters and bytes are the same skinnyg in Python 2).
It sort of toils with Unicode if you squint at it. For instance, I wrote 'hello',
‘☃’ and if you happen to understand the UTF-8 bytes for Unicode 'snowman'
, it’s 'xe2x98x83'
. So, it’s fair taking the UTF-8 bytes from my terminal and putting them straight into the array.
Here, we have the bytes in the Python 2 string array: the ASCII byte for 'h'
, the ASCII byte for 'e'
, and over in the second element of the array is the UTF-8 bytes for the Unicode snowman. It’s also meaningful to understand that for these mended-width strings—if you don’t fill up the width of the array, it fair inserts zeros to the end, which are null bytes.
>>> arr = np.array([u'hello', u'world'])
>>> arr
array([u'hello', u'world'], dtype="
Python 2 also had this Unicode type, where you could create an array with the greeteds 'hello', 'world'
, but as Unicode strings, and that creates an array with the DType 'U5'
. This toils, and it’s exactly what Python 2 did with Unicode strings. Each character is a UTF-32 encoded character, so four bytes per character
String Arrays in Python 3
>>> arr = np.array(['hello', 'world'])
>>> arr
array(['hello', 'world'], dtype="
In Python 3, they made this the default since Python 3 strings are Unicode strings, and that was the pragmatic, plain decision, but I talk about it was a terrible decision—and here’s why:
>>> arr.tobytes()
b'hx00x00x00ex00x00x00lx00x00x00lx00x00x00ox00x00x00wx00x00x00ox00x00x00rx00x00x00lx00x00x00dx00x00x00'
If we see at the bytes actupartner in the array, these are all ASCII characters, so they repartner only need one byte, which uncomardents there’s a bunch of zeros in the array that are fair squanderd. You’re using four times as much memory than is actupartner needed to store the array.
Before, it was written in C as a Python for-loop over the elements of the array. For each element of the array, it would create a scalar, call the string operation on that scalar, and then stuff the results into the result array. As you can envision, that’s pretty sluggish. But by rewriting it to loop over the array buffer without accessing each item as a scalar, you can create it anywhere from 500 times quicker for minuscule two-element arrays or two to five times quicker for prolongeder strings
Another skinnyg people have done, and what they’ve defaulted to becaengage of these rerents with Unicode strings in NumPy, is to engage object arrays.
>>> arr = np.array(
['this is a very long string', np.nan, 'another string'],
dtype=object
)
>>> arr
array(['this is a very long string', nan, 'another string'],
dtype=object)
You can create an array in NumPy with dtype=object
and it stores the Python strings and Python objects that you put into the array honestly. These are references to Python objects. If we call np.isnan
on the second element of the array, you get back np.True_
becaengage the object is np.nan
, and the other elements are python strings stored honestly in the array.
>>> arr = np.array(
['this is a very long string', np.nan, 'another string'],
dtype=object
)
>>> np.isnan(arr[1])
np.True_
>>> type(arr[0])
str