Dataclasses Explained (Part 1) -- 原文中英對照

資料類解釋(第 1 部分)- Dataclasses Explained (Part 1)

前言

這是 Dr. Fred 老師發表在 MathByte Academy youtube 教學頻道針對其中影片 - A Deep Dive into Python’s Dataclasses (Part 1) 的輔助文章,我只是將原文做個翻譯,方便大家對照來學習,原文出處在 Github blog repository 這裡,因此會建議想學習的同學,可以邊看影片邊讀文章,效果會更好喔~
D1
A Deep Dive into Python’s Dataclasses (Part 1)

The goal of this article/video is to explain how data classes work, not just show you how to create data classes.

本文/影片的目標是 解釋 資料類別如何運作,不僅僅只是向您展示如何建立資料類別。

In this notebook we’ll explore dataclasses and their correspondance to code we might write when implementing classes using plain old vanilla Python instead of the dataclass syntax.

在本筆記本中,我們將探索資料類別以及它們與我們在使用普通舊版 Python 而不是資料類別語法實作類別時可能編寫的程式碼的對應關係。

So what are dataclasses?

那麼究竟什麼是資料類別 (Data Class)?

Dataclasses were introduced to Python 3.7, but what are they?

它其實是在 Python 3.7 的時候所引入的資料類別物件,但它們又是什麼?

Some new type of data structure? Some new type of object?

是某種新型的資料結構呢?還是是泛指某種新類型的物件呢?

The answer to that is no.

(針對上述)答案(的推測)都是否定的。

A dataclass is simply a code generator that allows us to define custom classes using a different syntax, and allows us to generate what is often referred to as “boilerplate” code - code that is repetitive and basically always works the same way. Essentially a dataclass is a class decorator that can either monkey patch an existing class, or, when slots are involved, generates a new class based on the old one, with extra functionality injected.

資料類別(DataClass) 只是一個 程式碼生成器 這允許我們使用不同的語法定義自訂類別,並允許我們創建生成 通常被稱為 “boilerplate code” 的樣版程式碼 - 那些會一再被複製且基本上總是以相同方式工作的程式(指的應該是那些由欄位所組成的特定資料格式)。本質上, DataClass 是一個 類別的裝飾器(Decorator) 它可以作為補釘(monkey patch)修補現有的類別,或者當涉及插槽(slots)時,在既存的舊類別上產生一個新的類別(指的是一些由已知類別欄位組成的新資料格式),並在此新類別上注入額外的功(例如: ToJson 函式)。

You’ve seen code generators before if you’ve worked with named tuples, either namedtuple in the collections module, or the more modern NamedTuple class in the typing module.

如果您使用過具名元組(named tuples),無論是之前存在於 collections 模組中的 ‘namedtuple’ 又或是更新的 typing 模組中的’NamedTuple’,那麼其實您之前已經見過(這些)程式碼生成器。

Before you start using dataclasses, it is really important that you understand how to create your own classes in Python, and how to implement things like equality, hashing, ordering, etc. Although dataclasses hide all this from you, you should know how these things work in order to truly understand what dataclasses are creating for you, and avoid subtle bugs you may create by using dataclasses without understanding what’s happening under the hood.

在開始使用資料類別之前,了解如何在Python 中創建自己的類別物件,以及如何實現相等(equality)、雜湊(Hashing)、排序(ordering)等功能非常重要。雖然資料類別向您隱藏了所有這些(功能),但你應該要知道這些東西是如何被實現的,而為了真正理解資料類別正在為您創建什麼,並避免在不了解其背景執行的情況下使用資料類別時所可能導致的微妙錯誤。

The PEP for dataclasses (PEP-0557) writes this:

資料類別的 PEP (PEP-0557)文件是這樣寫的:

Although they use a very different mechanism, Data Classes can be thought of as “mutable namedtuples with defaults”. Because Data Classes use normal class definition syntax, you are free to use inheritance, metaclasses, docstrings, user-defined methods, class factories, and other Python class features.

儘管它們使用非常不同的機制,但資料類別可以被認為是「具有預設值的可變具名元組(mutable namedtuple)」。由於資料類別使用普通的類別定義語法,因此您可以自由使用繼承、元類別(meta class)、文件字串(docstrings)、自訂方法、類別工廠和其他 Python 類別功能。

and

以及

A class decorator is provided which inspects a class definition for variables with type annotations as defined in PEP 526, “Syntax for Variable Annotations”. In this document, such variables are called fields. Using these fields, the decorator adds generated method definitions to the class to support instance initialization, a repr, comparison methods, and optionally other methods as described in the Specification section. Such a class is called a Data Class, but there’s really nothing special about the class: the decorator adds generated methods to the class and returns the same class it was given.

提供了一個類別裝飾器,它檢查具有 PE​​P 526「變數註解語法」中定義的類型註解的變數的類別定義。在本文檔中,此類變數稱為欄位。使用這些字段,裝飾器將生成的方法定義添加到類別中,以支援實例初始化、repr、比較方法以及規範部分中描述的可選其他方法。這樣的類別稱為資料類別,但該類別實際上沒有什麼特別之處:裝飾器將產生的方法添加到該類別中,並傳回給定的相同類別。

資料類別和屬性庫 / Dataclasses and the attrs Library

A bit of history.

(補充)一點歷史:

One of the inspirations for dataclasses is the attrs library started by Hynek Schlawack in 2015.

資料類別的最初靈感之一是源自於屬性庫,而屬性庫的概念是始於 2005 年由 Hynek Schlawack 所提出。

https://www.attrs.org/en/stable/index.html

The attrs library became very popular around 2017, and people started asking for this to be included in the canonical Python standard library.

這屬性庫在 2017 年左右變得非常流行,人們開始要求將其包含在規範的 Python 標準庫中。

Discussions started on this, prompted by Python’s then BDFL, Guido, between Hynek and Eric Smith who volunteered for the project.

在 Python 當時的 BDFL Guido 的推動下,Hynek 和自願參與該計畫的 Eric Smith 之間開始了對此的討論。

In the end, instead of rolling attrs into the standard library, a simplifying subset of attrs was added to the standard library, and became known as dataclasses.

(討論)最終,屬性庫的概念不是基於滾動屬性而進入標準庫,而是以一個簡化的子集屬性被添加到標準庫中,並被稱之為資料類別。

You can see the PEP for it here

你可以從這裡查到它的 PEP 提案文件

Does this mean that attrs is no longer relevant today?

這是否意味著屬性在當下不再(與資料類別)相關了呢?

Not at all!

一點也不! (恰恰相反)

In fact attrs is very much under continued and very active development, and additional libraries, leveraging attrs are also under continuous development (for example the cattrs library).

實際上屬性的觀念與其它衍生的類別庫在開發上正在持續且非常的活躍中,如何利用屬性的概念這樣的觀點也正在不斷發展當中,例如卡特爾斯庫(cattr library).

Here is an interesting post by Hynek on attrs and dataclasses:
https://hynek.me/articles/import-attrs/

這裡有一篇 Hynek 發表的關於屬性和資料類別的有趣貼文:https://hynek.me/articles/import-attrs/

A big thanks to Hynek Schlawack and the attrs repo collaborators for their continued dedication to attrs!!

非常感謝 Hynek Schlawack 和 attrs repo 協同開發者在 attrs 上的持續奉獻!!

Now, let’s start digging into dataclasses.

現在,讓我們開始深入研究資料類別。

I am not going to discuss the differences/similarities of dataclasses vs named tuples vs pydantic vs attrs - I might do another post on that in the future. Simplistically, there is some overlap, but also a lot of differences between them, and they have different use cases.

我不會討論資料類別之於 具名元組 或 pydantic 以及 attrs 的同異之處 - 我(Dr. Fred)將來可能會就此發表另一篇文章。簡單地說,它們之間有一些重疊,但也有很多差異,並且它們有不同的用例。

I see a lot of click-bait articles/videos out there that spell out how one killed the other - and that’s all it is, click bait.Those people either don’t truly understand what each of those things are, or are being disingeneous, and, unfortunately, ultimately causing confusion.

我看到很多引誘點擊的文章和視頻,詳細地說明了上述這當中的某一個消滅了其他個 - 然而這只是(無用的)引誘點擊的文章而已,這些文章的作者人要嘛就是對上述這些類別物件並沒有真正的理解它們是什麼,要嘛就是對這些類別物件的觀念理解不清,但不幸的是,最終對閱讀他們文章的讀者造成了混亂。

基礎 / The Basics

Note: I am using Python 3.11 for these examples. Earlier versions of Python may not have all the functionality presented here, as dataclasses are (slowly) evolving and gaining additional functionality from one Python version to the next. That’s one advantage of using attrs over dataclasses - attrs has not only a lot more functionality than dataclasses, but evolves faster since it is not bound to the Python release cycle.

註解:我在下面用來解釋的這些範例中使用了 Python 3.11。然而更早期版本的 Python 可能不具備此處介紹的所有功能,因為資料類別正在(緩慢)發展中,並從歷代 Python 版本中逐次的累積及獲得一些附加功能。這是使用屬性超過資料類別的優點之一 - 屬性不僅比資料類別具備更多的功能性,且發展速度更快,因為它不受 Python 發布週期的約束。

As the name dataclasses would seem to indicate, dataclasses are classes used for data structures (similar in some ways to named tuples, except that dataclasses offer a lot more since they are regular Python classes, not specializations of the tuple class)

正如資料類別這個名字一樣,似乎表明資料類別是用於標示數據結構的類別物件(在某些方面與具名元組類似,除了資料類別提供了更多功能之外,再則它們是常規的 Python 類別,而不是特定(非常規)的元組類別)

Let’s take a look at how dataclasses can be used to generate standard Python classes, and see how much boilerplate code they eliminate.

讓我們來看看如何使用資料類別來創建標準的 Python 類別物件,並看看它們消減了多少的樣板程式碼。

As an example, let’s create a two-dimensional Circle class that needs attributes for it’s origin (x and y) and it’s radius. To avoid complications with float comparisons, I’m going to limit these attributes to integers (which is not totally insance given that integers would be just fine to draw such a circle on a screen anyway.)

來看下面範例: 讓我們創建一個二維的圓形類別物件,要創建這樣(圓形)類別物件,我們會需要中心點座標(x, y)以及圓形的半徑(radius)等參數作為其類別屬性。而為了避免參數賦值因為其浮點位數的比較造成爭議(而模糊了焦點),因此我將把這些屬性限制為整數(這並不完全是無意義的,因為無論如何,(設定成)整數也同樣可以讓我們在螢幕上繪製這樣的圓形物件,並不會造成問題。)

class Circle:
    def __init__(self, x: int = 0, y: int = 0, radius: int = 1):
        self.x = x
        self.y = y
        self.radius = radius
c = Circle()
c

Let’s add some functionality that we usually add (or should add) to our class.

讓我們來增加一些我們通常會在這樣類別物件中添加(或應該添加)的功能。

First, let’s have a custom __repr__

首先,是自訂的 __repr__

class Circle:
    def __init__(self, x: int = 0, y: int = 0, radius: int = 1):
        self.x = x
        self.y = y
        self.radius = radius

    def __repr__(self):
        return f"{self.__class__.__qualname__}(x={self.x}, y={self.y}, radius={self.radius})"
c1 = Circle(0, 0, 1)
c1
Circle(x=0, y=0, radius=1)

Now let’s see how we can do the same thing using a dataclass:

現在讓我們看看如何使用資料類別來做同樣的事情:

from dataclasses import dataclass
@dataclass
class CircleD:
    x: int = 0
    y: int = 0
    radius: int = 1    
c2 = CircleD()
c1
Circle(x=0, y=0, radius=1)
c2
CircleD(x=0, y=0, radius=1)

Basically the dataclass gave us the __repr__ for “free”. We don’t have to type that code ourselves, and the less code we type the less bugs we are likely to introduce.

基本上,因為我們在類別物件使用 @dataclass 裝飾器宣告這個類別物件是一個資料類別物件時,這個資料類別就無條件給了我們這個類別賦予了 __repr__ 這個自帶的內建方法。我們不必自行多輸入程式碼來定義 __repr__,而輸入的程式碼越少,相應的可能引入的錯誤也就會越少。

Not only that, but dataclasses will generate that code using best practices - how many of use really use self.__class__.__qualname__? Many people (myself included I’ll confess) just use the hardcoded class name, maybe stretching it to self.__class__.__name__, when using __qualname__ is actually better. (I’ll let you do some web searches on your own to figure out why if you don’t know already).

不僅如此,資料類別還將使用最佳實踐來生成該程式碼 - 有多少使用真正使用self.__class__.__qualname__?許多人(包括我自己,我承認)都只是硬刻(Hardcoded)類別名稱,也許將其擴展為self.__class__.__name__,然而當使用 __qualname__ 其實會更好。 (如果您還不知道,我會讓您自己進行一些網路搜尋來找出原因)。

Both classes work the same way as far as attribute access goes:

就屬性存取而言,這兩個類別的工作方式相同:

c1.x, c2.radius
(0, 1)
c1.x = 100
c2.radius = 100
c1, c2
(Circle(x=100, y=0, radius=1), CircleD(x=0, y=0, radius=100))

If you think back to the intro of this video, I pulled an excerpt from the PEP:

如果您回想一下該影片的介紹,我從 PEP 中摘錄了一段內容:

the decorator adds generated methods to the class and returns the same class it was given

裝飾器將生成的方法添加到類別中並返回給定的相同類別

Let’s test that out!

讓我們來測試一下!

class CircleD:
    x: int = 0
    y: int = 0
    radius: int = 1   
hex(id(CircleD))
'0x2c42d099b10'
c = CircleD()
repr(c)
'<__main__.CircleD object at 0x000002C42E2607A0>'
CircleD = dataclass(CircleD)
hex(id(CircleD))
'0x2c42d099b10'
repr(c)
'CircleD(x=0, y=0, radius=1)'

As you can see, @dataclass did not create a new object, it basically modified, at run time, a regular Python class by adding attributes and methods to the class.

如您所見,@dataclass 並沒有創建一個新的物件,它基本上是在運行時修改了一個常規的 Python 類別,並向該類別添加了屬性和方法。

There has been an update to dataclasses that allows us to use slots instead of a class instance dictionary for maintaining state - in these cases, a new class object is generated - it has to, since you cannot, once a class has been created, change whether your are using slots or not.

資料類別已經更新,允許我們使用插槽(slots)而不是類別實例的字典來維護狀態 - 在這些情況下,將生成一個新的類別物件 - 它必須這樣做,因為一旦創建了類別,就無法更改 - 是否使用插槽(slots)。

相等比較 Equality Comparisons

Something else we get for free is equality comparisons:

我們可以無償地獲得的另一個功能就是相等比較:

c3 = CircleD(1, 1, 5)
c4 = CircleD(1, 1, 5)
c3 == c4
True

Our custom class does not have that functionality. By default, custom classes will use identity (object’s id) to equality compare two instances:

我們自訂的類別物件並沒有這個功能。默認情況下,自訂的類別物件將使用身份(物件的 id)來比較兩個實例是否相等:

c1 = Circle(1, 1, 5)
c2 = Circle(1, 1, 5)
c1 == c2
False

That’s usually not what we want (it could be, but often not), this means that we need to override the __eq__ method:

這通常不是我們想要的(它可能是,但通常不是),這意味著我們(要相等比較)還需要去覆寫 __eq__ 方法:

class Circle:
    def __init__(self, x: int = 0, y: int = 0, radius: int = 1):
        self.x = x
        self.y = y
        self.radius = radius

    def __repr__(self):
        return f"{self.__class__.__qualname__}(x={self.x}, y={self.y}, radius={self.radius})"
    
    def __eq__(self, other):
        if self.__class__ == other.__class__:
            return (self.x, self.y, self.radius) == (other.x, other.y, other.radius)
        return NotImplemented
        

And let’s try it out now:

現在讓我們來試試看:

c1 = Circle(0, 0, 1)
c2 = Circle(0, 0, 1)

c1 is c2, c1 == c2
(False, True)
可哈希性 Hashability

Great! But you should know that when you implement a custom __eq__ method, we often also implement hashability as well (I discuss this in my deep dive series, all notebooks are freely available on github here, so I am not going to discuss the reason behind this here).

太棒了!但是您應該知道,當您實現自定義的 __eq__ 方法時,我們通常也會實現可哈希性(我在我 Deep Dive 系列課程中有討論過這一點,所有課程當時的筆記都可以在 Deep Dive Course 的 github 網站上自由的參考及使用。因此,我不打算在這裡討論其背後的原因)。

With our current implementation, the Circle class is not even hashable. (And neither is the dataclass)

根據我們目前的實現,Circle 類別甚至不可哈希(資料類別也是如此)。

Ok, so let’s implement the __hash__ function, such that the hash of two Circle objects that are equal will also have an equal hash:

好吧,讓我們實現 __hash__ 函數,以便相等的兩個 Circle 物件的哈希值也具有相等的哈希值:

class Circle:
    def __init__(self, x: int = 0, y: int = 0, radius: int = 1):
        self.x = x
        self.y = y
        self.radius = radius

    def __repr__(self):
        return f"{self.__class__.__qualname__}(x={self.x}, y={self.y}, radius={self.radius})"
    
    def __eq__(self, other):
        if self.__class__ == other.__class__:
            return (self.x, self.y, self.radius) == (other.x, other.y, other.radius)
        return NotImplemented
    
    def __hash__(self):
        return hash((self.x, self.y, self.radius))
c1 = Circle(0, 0, 1)
c2 = Circle(0, 0, 1)
c1 == c2, hash(c1) == hash(c2)
(True, True)

Much better - now we can even use instances of our Circle class as set elements or dictionary keys.

好多了 - 現在我們甚至可以將 Circle 類別的實例用作集合元素或字典鍵。

s = {Circle(), Circle()}
s
{Circle(x=0, y=0, radius=1)}

As you can see, because the Circle instances are hashable, and we have equality implemented, the set retained, as we would expect, only one unique element. Same thing with dictionaries:

如您所見,由於 Circle 實例是可哈希的,並且我們已經實現了相等性,因此集合保留了我們所期望的唯一元素。字典也是一樣的道理:

d = {Circle(): "circle"}
d
{Circle(x=0, y=0, radius=1): 'circle'}
d[Circle()] = 'custom circle'
d
{Circle(x=0, y=0, radius=1): 'custom circle'}

But there is an issue!

但是有個問題!

Normally, hashable objects should be immutable, otherwise we can run into issues - let’s see that.

通常,可哈希的物件應該是不可變的,否則我們可能會遇到問題 - 讓我們來看看。

c1 = Circle(0, 0, 1)
c2 = Circle(1, 1, 1)
d = {
    c1: "circle 1",
    c2: "circle 2"
}

d
{Circle(x=0, y=0, radius=1): 'circle 1',
 Circle(x=1, y=1, radius=1): 'circle 2'}

Now let’s mutate that second circle:

現在讓我們來改變第二個圓形物件:

c2.x = 0
c2.y = 0
c2
Circle(x=0, y=0, radius=1)

Now c1 and c2 are equal, and in fact also have equal hashes now:

現在 c1c2相等的,實際上現在它們的哈希值也是相等的:

c1 == c2, hash(c1) == hash(c2)
(True, True)

And let’s look at our dictionary:

讓我們來看看我們的字典:

d
{Circle(x=0, y=0, radius=1): 'circle 1',
 Circle(x=0, y=0, radius=1): 'circle 1'}

Notice something weird? Looks like two duplicate entries in the dictionary - and sometimes we get other odd behavior depending on how we mutate the key objects. So, in general we really need immutability for dictionary keys.

注意到了嗎?字典中看起來有兩個重複的項目 - 有時候,我們會根據如何改變鍵物件而獲得其他奇怪的行為。因此,通常我們真的需要字典鍵的不可變性。

不可變性 Immutability

To make our custom class implementation better we need to make the attributes used in the hash, x, y, and radius immutable.

為了使我們自訂的類別物件實現更好,我們需要使哈希中使用的屬性 xyradius 不可變。

Let’s add even more boilerplate code to our class:

讓我們為我們的類別物件添加更多的樣板程式碼:

class Circle:
    def __init__(self, x: int = 0, y: int = 0, radius: int = 1):
        self._x = x
        self._y = y
        self._radius = radius

    @property
    def x(self):
        return self._x
    
    @property
    def y(self):
        return self._y
    
    @property
    def radius(self):
        return self._radius
    
    def __repr__(self):
        return f"{self.__class__.__qualname__}(x={self.x}, y={self.y}, radius={self.radius})"
    
    def __eq__(self, other):
        if self.__class__ == other.__class__:
            return (self.x, self.y, self.radius) == (other.x, other.y, other.radius)
        return NotImplemented
    
    def __hash__(self):
        return hash((self.x, self.y, self.radius))
c1 = Circle()
c2 = Circle(1, 1, 1)
d = {
    c1: "cirle 1",
    c2: "circle 2",
}

d
{Circle(x=0, y=0, radius=1): 'cirle 1', Circle(x=1, y=1, radius=1): 'circle 2'}

And now, if we try to mutate c2, we’ll get an exception:

現在,如果我們嘗試改變 c2,我們將會得到一個例外:

try:
    c2.x = 0
except AttributeError as ex:
    print(f"Attribute Error: {ex}")
Attribute Error: property 'x' of 'Circle' object has no setter

Again let me ask you this. How many times have you written a class that implements equality and hashability and forgotten to make read-only properties out of the attributes that are used in your hash function?

再次問您這個問題。您有多少次寫了一個類別物件來實現相等性和可哈希性,並忘記將用於哈希函數的屬性設定為唯讀屬性?

Even if we are aware of this, and take the trouble to try and make it difficult for users of our class to inadvertently create problems by mutating the object (like using read-only propeties as we did here), it’s a lot of code - and more potential for typos and bugs. More unit testing too!

即使我們知道這一點,並且不厭其煩地試圖使我們的類別物件的使用者難以通過改變物件來無意中創建問題(例如在這裡使用唯讀屬性),這是很多程式碼 - 也有更多的錯誤和錯誤的潛在可能性。更多的單元測試也是如此!

Again, dataclasses can help us here by generating all this code for us (note that dataclasses do not use properties to make read-only attributes, but rather they override the __setattr__ and __delattr__ to achieve a similar result. Although we could replicate that approach in our code quite easily, and would avoid writing properties, it involves hardcoding the attributes we want to be “frozen” into the __setattr__ and __delattr__ methods. This means every time we add or remove a frozen attribute, we need to remember to modify that code as well. While that works just fine for a code generator that re-builds the code every time the app is re-started (and the dataclass decorator is re-evaluated), it’s not great for human developers.

同樣,資料類別可以通過為我們生成所有這些程式碼來幫助我們(請注意,資料類別不使用屬性來使屬性變為唯讀屬性,而是它們覆寫了 __setattr____delattr__ 以達到類似的結果。雖然我們可以在我們的程式碼中複製該方法,並避免編寫屬性,但這涉及將我們想要“凍結”的屬性硬刻(hardcoding)到 __setattr____delattr__ 方法中。這意味著每次我們添加或刪除一個凍結的屬性時,我們都需要記住修改該程式碼。雖然這對於每次重新啟動應用程序(並重新評估資料類別裝飾器)時重新構建程式碼的程式碼生成器來說沒有問題,但對於開發人員來說,這並不是很好(有可能會忘記而忽略,造成問題!!!))

@dataclass(frozen=True)
class CircleD:
    x: int = 0
    y: int = 0
    radius: int = 1
c3 = CircleD()
c4 = CircleD(1, 1, 1)
c5 = CircleD()

c3, c4, c5
(CircleD(x=0, y=0, radius=1),
 CircleD(x=1, y=1, radius=1),
 CircleD(x=0, y=0, radius=1))
c3 == c5, c4 == c5
(True, False)

Again, equality still works just fine. But now our dataclass CircleD is both immutable and hashable (at least as far as the attributes x, y, and radius go).

同樣,相等性仍然運作正常。但是現在我們的資料類別 CircleD 同時是不可變的和可哈希的(至少就屬性 xyradius 而言)。

hash(c3), hash(c4), hash(c5)
(-1882636517035687140, 5750192569890809213, -1882636517035687140)
from dataclasses import FrozenInstanceError

try:
    c4.x = 0
except FrozenInstanceError as ex:
    print(f"FrozenInstanceError: {ex}")
FrozenInstanceError: cannot assign to field 'x'

This means that we can also safely use it in sets and dictionary keys as well.

這意味著我們也可以安全地將其用於集合和字典鍵。

s = {c3, c4, c5}
s
{CircleD(x=0, y=0, radius=1), CircleD(x=1, y=1, radius=1)}
排序 Ordering

One other thing we do not have in our custom Circle class is ordering:

我們自訂的 Circle 類別物件中還有一件事是我們沒有實現的,那就是排序:

c1 = Circle()
c2 = Circle(1, 1, 2)

try:
    c1 < c2
except TypeError as ex:
    print(f"TypeError: {ex}")
TypeError: '<' not supported between instances of 'Circle' and 'Circle'

In order to implement ordering we need to override special functions such as __lt__, __le__, __gt__, __ge__

為了實現排序,我們需要覆寫特殊函數,例如 __lt____le____gt____ge__

Let’s do this for our custom class. Here, we will consider ordering based on whether the tuple (x, y, radius) of one circle is smaller or larger than the ame tuple for the other circle. Does not make much sense, we would probably want something more custom - maybe based on the radius only, so we’ll come back to that later.

讓我們為我們的自訂類別物件做這件事。在這裡,我們將根據一個圓形的元組(x、y、radius)是否小於或大於另一個圓形的元組來考慮排序。這沒有太多意義,我們可能想要更多自訂的東西 - 也許只基於半徑,所以我們稍後會回來討論這個。

class Circle:
    def __init__(self, x: int = 0, y: int = 0, radius: int = 1):
        self._x = x
        self._y = y
        self._radius = radius

    @property
    def x(self):
        return self._x
    
    @property
    def y(self):
        return self._y
    
    @property
    def radius(self):
        return self._radius
    
    def __repr__(self):
        return f"{self.__class__.__qualname__}(x={self.x}, y={self.y}, radius={self.radius})"
    
    def __eq__(self, other):
        if self.__class__ == other.__class__:
            return (self.x, self.y, self.radius) == (other.x, other.y, other.radius)
        return NotImplemented
    
    def __hash__(self):
        return hash((self.x, self.y, self.radius))
    
    def __lt__(self, other):
        return (self.x, self.y, self.radius) < (other.x, other.y, other.radius)

Ok, so that’s a start:

好的,這是一個開始:

c1 = Circle(0, 0, 1)
c2 = Circle(1, 1, 1)

c1 < c2
True

Of course, through reflection we also get > for “free”:

當然,通過反射,我們也可以無償獲得 >

c2 > c1
True

But we have a few issues:

但是我們有一些問題:

  • we could compare a Circle instance to another data structure inadvertently, and actually get a result

  • 我們可能會無意中將 Circle 實例與另一個資料結構進行比較,並實際獲得結果

  • we still need to implement >=, <=, etc

  • 我們仍然需要實現 >=<=

try:
    c1 <= c2
except TypeError as ex:
    print(f"TypeError: {ex}")
TypeError: '<=' not supported between instances of 'Circle' and 'Circle'
from typing import NamedTuple

class CircleNT(NamedTuple):
    x: int = 0
    y: int = 0
    radius: int = 1
c1 = Circle()
c2 = CircleNT(1, 1, 1)

c1, c2
(Circle(x=0, y=0, radius=1), CircleNT(x=1, y=1, radius=1))
c1 < c2
True

This is probably not what we want - we should only allow comparisons between instances of the same class. In fact, that’s how our equality function works, both in our custom class and in our dataclass.

這可能不是我們想要的 - 我們應該只允許在同一類別的實例之間進行比較。實際上,這就是我們的相等函數如何工作的(原理),無論是在我們的自訂類別中(進行比較)還是在我們的資料類別中(進行比較)。

c1 = Circle()
c2 = CircleNT()

c1, c2
(Circle(x=0, y=0, radius=1), CircleNT(x=0, y=0, radius=1))
c2 == c1
False
c1 = CircleD()
c2 = CircleNT()

c1, c2
(CircleD(x=0, y=0, radius=1), CircleNT(x=0, y=0, radius=1))
c1 == c2
False

So, again, let me ask you this question. How often do you remember to also check the type in the order methods?

所以,讓我再問你這個問題。您有多少次記得在排序方法中也檢查類型 (確認類型是否相同)?

So, let’s fix our code to account for that.

class Circle:
    def __init__(self, x: int = 0, y: int = 0, radius: int = 1):
        self._x = x
        self._y = y
        self._radius = radius

    @property
    def x(self):
        return self._x
    
    @property
    def y(self):
        return self._y
    
    @property
    def radius(self):
        return self._radius
    
    def __repr__(self):
        return f"{self.__class__.__qualname__}(x={self.x}, y={self.y}, radius={self.radius})"
    
    def __eq__(self, other):
        if self.__class__ == other.__class__:
            return (self.x, self.y, self.radius) == (other.x, other.y, other.radius)
        return NotImplemented
    
    def __hash__(self):
        return hash((self.x, self.y, self.radius))
    
    def __lt__(self, other):
        if self.__class__ == other.__class__:
            return (self.x, self.y, self.radius) < (other.x, other.y, other.radius)
        return NotImplemented
c1 = Circle()
c2 = CircleNT()

c1, c2
(Circle(x=0, y=0, radius=1), CircleNT(x=0, y=0, radius=1))
c1 == c2
False

Now, we need to implement the other ordering functions.

現在,我們需要實現其他排序函數。

We could try using the total_ordering decorator available in the functools module:

我們可以嘗試使用 functools 模組中提供的 total_ordering 裝飾器:

from functools import total_ordering
@total_ordering
class Circle:
    def __init__(self, x: int = 0, y: int = 0, radius: int = 1):
        self._x = x
        self._y = y
        self._radius = radius

    @property
    def x(self):
        return self._x
    
    @property
    def y(self):
        return self._y
    
    @property
    def radius(self):
        return self._radius
    
    def __repr__(self):
        return f"{self.__class__.__qualname__}(x={self.x}, y={self.y}, radius={self.radius})"
    
    def __eq__(self, other):
        if self.__class__ == other.__class__:
            return (self.x, self.y, self.radius) == (other.x, other.y, other.radius)
        return NotImplemented
    
    def __hash__(self):
        return hash((self.x, self.y, self.radius))
    
    def __lt__(self, other):
        if self.__class__ == other.__class__:
            return (self.x, self.y, self.radius) < (other.x, other.y, other.radius)
        return NotImplemented

Let’s try it out:

讓我們來試試看:

c1 = Circle()
c2 = Circle(1, 1, 1)
c1 < c2
True
c1 <= c2
True
c2 >= c1
True
c2 > c1
True

What about the case where we aren’t comparing the same objects (but where they have the same attribute names):

那麼在我們不比較相同的物件(但它們具有相同的屬性名稱)的情況下呢:

c1 = Circle()
c2 = CircleNT(1, 1, 1)
try:
    c1 < c2
except TypeError as ex:
    print(f"TypeError: {ex}")
TypeError: '<' not supported between instances of 'Circle' and 'CircleNT'
try:
    c1 < c2
except TypeError as ex:
    print(f"TypeError: {ex}")
TypeError: '<' not supported between instances of 'Circle' and 'CircleNT'
try:
    c1 <= c2
except TypeError as ex:
    print(f"TypeError: {ex}")
TypeError: '<=' not supported between instances of 'Circle' and 'CircleNT'
try:
    c1 >= c2
except TypeError as ex:
    print(f"TypeError: {ex}")
TypeError: '>=' not supported between instances of 'Circle' and 'CircleNT'

Ok, so total_ordering worked for us, and saved us writing quite a lot of boilerplate code.

好的,所以 total_ordering 為我們工作,並為我們節省了編寫大量樣板程式碼的時間。

What about dataclasses? By default, dataclasses do not implement any ordering.

資料類別呢?默認情況下,資料類別不實現任何排序。

c1 = CircleD()
c2 = CircleD(1, 1, 1)
try:
    c1 < c2
except TypeError as ex:
    print(f"TypeError: {ex}")
TypeError: '<' not supported between instances of 'CircleD' and 'CircleD'

But we can easily enable this in our dataclass this way:

但是我們可以通過以下方式輕鬆啟用此功能:

@dataclass(frozen=True, order=True)
class CircleD:
    x: int = 0
    y: int = 0
    radius: int = 1
c1 = CircleD()
c2 = CircleD(1, 1, 1)
c1 < c2, c1 <= c2, c2 > c1, c2 >= c1
(True, True, True, True)

And it will not support comparing to a different type:

它不支持與不同類型的物件進行比較:

c1 = CircleD()
c2 = CircleNT(1, 1, 1)
try:
    c1 < c2
except TypeError as ex:
    print(f"TypeError: {ex}")
TypeError: '<' not supported between instances of 'CircleD' and 'CircleNT'

The only thing to note is how dataclasses define the ordering - it basically uses a tuple made of up the attributes defined in the dataclass, in the order in which they were declared (equality is iplemented the same way too).

唯一需要注意的是資料類別如何定義排序 - 它基本上使用了一個元組,該元組由資料類別中定義的屬性組成,並按照它們被宣告的順序排列(相等性也以相同的方式實現)。

This means that you are limited in terms of how to define a custom ordering by default. We’ll come back to that in a bit.

這意味著您在默認情況下將受到如何定義自定義排序的限制。我們稍後會回來討論這個問題。

針對字典與元組進行序列化 (Serializing to Dictionaries and Tuples)

Something that can be convenient sometimes, is the ability to extract the attribute values of an instance of our class into a dictionary (where the keys are the attribute names), or even into a tuple (ordered in some specific way).

有時候,有時候可以很方便的是,將我們類別物件的屬性值提取到字典中(其中鍵是屬性名稱),甚至提取到元組中(以某種特定的方式排序)。

Dataclasses have that built-in as well.

資料類別也內建了這個功能。

from dataclasses import asdict, astuple
c1 = CircleD()
asdict(c1)
{'x': 0, 'y': 0, 'radius': 1}
astuple(c1)
(0, 0, 1)

If we wanted something similar in our custom class, we would have to write that code ourselves.

如果我們想在我們自訂的類別物件中實現類似的功能,我們必須自己編寫該程式碼。

Let’s do it.

讓我們來做這件事。

@total_ordering
class Circle:
    def __init__(self, x: int = 0, y: int = 0, radius: int = 1):
        self._x = x
        self._y = y
        self._radius = radius

    @property
    def x(self):
        return self._x
    
    @property
    def y(self):
        return self._y
    
    @property
    def radius(self):
        return self._radius
    
    def __repr__(self):
        return f"{self.__class__.__qualname__}(x={self.x}, y={self.y}, radius={self.radius})"
    
    def __eq__(self, other):
        if self.__class__ == other.__class__:
            return (self.x, self.y, self.radius) == (other.x, other.y, other.radius)
        return NotImplemented
    
    def __hash__(self):
        return hash((self.x, self.y, self.radius))
    
    def __lt__(self, other):
        if self.__class__ == other.__class__:
            return (self.x, self.y, self.radius) < (other.x, other.y, other.radius)
        return NotImplemented
    
    def asdict(self):
        return {
            'x': self.x,
            'y': self.y,
            'radius': self.radius
        }
    
    def astuple(self):
        return self.x, self.y, self.radius

Now this works, but is rather simplistic, and means that if we ever add/remove attributes rom the class we also need to update the asdict and astuple methods - not the end of the world, but can easily lead to bugs.

現在這個方法可以運作,但是它相當簡單,這意味著如果我們從類別中添加/刪除屬性,我們也需要更新 asdictastuple 方法 - 這並不是世界末日,但很容易導致錯誤。

c1 = Circle()
c1.asdict()
{'x': 0, 'y': 0, 'radius': 1}
c1.astuple()
(0, 0, 1)

We could certainly start writing more complicated code to be more generic, but we’ll have to deal with introspection, decide what attributes to include, etc - not simple!

我們當然可以開始編寫更複雜的程式碼來更通用,但是我們必須先處理: 欄位的自我檢查(introspection)、決定要包含哪些屬性等等的程序 - 這並不簡單!

欄位的自我檢查 (Fields Introspection)

Speaking of introspection, dataclasses come through on that front too!

說到(內部屬性或欄位的)自我檢查,資料類別也在這方面做得很好!

from dataclasses import fields
c1 = CircleD()
for field in fields(c1):
    print(field, end='\n---------------------\n')
Field(name='x',type=<class 'int'>,default=0,default_factory=<dataclasses._MISSING_TYPE object at 0x000002C42B33F0B0>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD)
---------------------
Field(name='y',type=<class 'int'>,default=0,default_factory=<dataclasses._MISSING_TYPE object at 0x000002C42B33F0B0>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD)
---------------------
Field(name='radius',type=<class 'int'>,default=1,default_factory=<dataclasses._MISSING_TYPE object at 0x000002C42B33F0B0>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD)
---------------------
資料類別與自訂類別間的比較 (Comparing Dataclass to Custom Class Code So Far)

Here’s our data class:

這是我們的資料類別:

@dataclass(frozen=True, order=True)
class CircleD:
    x: int = 0
    y: int = 0
    radius: int = 1

And our custom class:

和我們自訂的類別物件:

@total_ordering
class Circle:
    def __init__(self, x: int = 0, y: int = 0, radius: int = 1):
        self._x = x
        self._y = y
        self._radius = radius

    @property
    def x(self):
        return self._x
    
    @property
    def y(self):
        return self._y
    
    @property
    def radius(self):
        return self._radius
    
    def __repr__(self):
        return f"{self.__class__.__qualname__}(x={self.x}, y={self.y}, radius={self.radius})"
    
    def __eq__(self, other):
        if self.__class__ == other.__class__:
            return (self.x, self.y, self.radius) == (other.x, other.y, other.radius)
        return NotImplemented
    
    def __hash__(self):
        return hash((self.x, self.y, self.radius))
    
    def __lt__(self, other):
        if self.__class__ == other.__class__:
            return (self.x, self.y, self.radius) < (other.x, other.y, other.radius)
        return NotImplemented
    
    def asdict(self):
        return {
            'x': self.x,
            'y': self.y,
            'radius': self.radius
        }
    
    def astuple(self):
        return self.x, self.y, self.radius

So, for very little effort we get quite a lot of functionality out of dataclasses. And what we have looked at so far will probably cover 80% of anything you’d ever need out of dataclasses (yes, I made that number up!).

因此,我們只需付出很少的努力,就可以從資料類別中獲得很多功能。到目前為止,我們所看到的內容可能會涵蓋您從資料類別中所需要的 80% 的功能(是的,我是隨便說的這個數字!)。

對資料類別加入方法和屬性 Adding Methods and Properties to Dataclasses

Remember that dataclases is just a code generator that generates a standard Python class. This means that we can choose to add additional properties, methods, and even override special dunder methods however we want.

記住,資料類別只是一個代碼生成器,它會生成一個標準的 Python 類別。這意味著我們可以選擇添加其他屬性、方法,甚至以任何我們想要的方式覆寫特殊的 dunder 方法。

from math import pi

@dataclass(frozen=True, order=True)
class CircleD:
    x: int = 0
    y: int = 0
    radius: int = 1
        
    @property
    def area(self):
        return pi * self.radius ** 2
    
    def circumference(self):
        return 2 * pi * self.radius
c = CircleD()
c.area, c.circumference()
(3.141592653589793, 6.283185307179586)
自訂排序 Custom Ordering

We can even override the special dunder methods. Let’s go back to our ordering of the Circles.

我們甚至可以覆寫特殊的 dunder 方法。讓我們回到我們的圓形物件的排序。

We saw how the default ordering that dataclasses defined for our Circles was not ideal. Let’s say I really want to define ordering between circles in one of two ways:

我們看到資料類別為我們的圓形物件定義的默認排序並不理想。假設我真的想以以下兩種方式之一定義圓形物件之間的排序:

  • based on the radius only

  • 僅基於半徑

  • based on the distance from the origin

  • 基於從原點的距離

There is no way (currently, that I know of) to completely customize a sort order key function in dataclasses.

目前(據我所知),沒有辦法完全自定義資料類別中的排序鍵函數。

You can specify whether a field should be included or not in the comparison tuple though - so doing a comparison based only on the radius would be possible using the native functionality of dataclasses. The second sort option would not, and you therefore need to implement your own __lt__, __le__, etc.

但是,您可以指定是否應在比較元組中包含某個欄位 - 因此,使用資料類別的本機功能,可以基於半徑進行比較。第二個排序選項不會,因此您需要自己實現 __lt____le__ 等。

You can, however, completely customize the sort key in the more powerful attrs library (using the cmp_using) attribute.

但是,您可以在更強大的 attrs 库中完全自定義排序鍵(使用 cmp_using 屬性)。

Let’s implement a custom sort order based on the distance from the origin (we’ll circle back to the other sort order, which is easy to achieve using native dataclasses functionality).

讓我們實現一個基於從原點的距離的自定義排序(我們將回到另一個排序,該排序可以使用本機資料類別功能輕鬆實現)。

from math import pi, dist

@dataclass(frozen=True)
class CircleD:
    x: int = 0
    y: int = 0
    radius: int = 1
        
    @property
    def area(self):
        return pi * self.radius ** 2
    
    def circumference(self):
        return 2 * pi * self.radius
    
    def __lt__(self, other):
        if self.__class__ == other.__class__:
            return dist((0, 0), (self.x, self.y)) < dist((0, 0), (other.x, other.y))
        return NotImplemented

We now should have < implemented:

我們現在應該已經實現了 <

c1 = CircleD(2, 2, 10)
c2 = CircleD(3, 3, 100)
c1 < c2
True

Of course we need to implement the other methods too:

當然,我們也需要實現其他方法:

try:
    c1 <= c2
except TypeError as ex:
    print(f"TypeError: {ex}")
TypeError: '<=' not supported between instances of 'CircleD' and 'CircleD'

We can actually do this using the total_ordering decorator!

我們實際上可以使用 total_ordering 裝飾器來做到這一點!

@total_ordering
@dataclass(frozen=True)
class CircleD:
    x: int = 0
    y: int = 0
    radius: int = 1
        
    @property
    def area(self):
        return pi * self.radius ** 2
    
    def circumference(self):
        return 2 * pi * self.radius
    
    def __lt__(self, other):
        if self.__class__ == other.__class__:
            return dist((0, 0), (self.x, self.y)) < dist((0, 0), (other.x, other.y))
        return NotImplemented
c1 = CircleD(2, 2, 10)
c2 = CircleD(3, 3, 100)

c1 <= c2
True

So, the question I have here, is in which order should we decorate with total_ordering? The way I have it here? Or this way?

所以,我在這裡的問題是,我們應該以哪種順序裝飾 total_ordering?我在這裡的方式?還是這樣?

@dataclass(frozen=True)
@total_ordering
class CircleD:
    x: int = 0
    y: int = 0
    radius: int = 1
        
    @property
    def area(self):
        return pi * self.radius ** 2
    
    def circumference(self):
        return 2 * pi * self.radius
    
    def __lt__(self, other):
        if self.__class__ == other.__class__:
            return dist((0, 0), (self.x, self.y)) < dist((0, 0), (other.x, other.y))
        return NotImplemented

Both approaches seem to work, but because dataclasses is a code generator, and is actually opaque, I have no real idea if one way may or may not be better.

這兩種方法似乎都可以,但是因為資料類別是一個代碼生成器,並且實際上是不透明的,所以我不知道哪種方式可能更好。

My reasoning is that the second approach is preferrable. Suppose I did not use that total_ordering decorator - I would want to define all the __lt__, __le__ methods in my class that then gets decorated with @dataclass.

我的推理是第二種方法更好。假設我沒有使用 total_ordering 裝飾器 - 我想在我的類別中定義所有的 __lt____le__ 方法,然後再使用 @dataclass 裝飾。

So, my reasoning is that I should first apply the @total_ordering decorator, and then apply the @dataclass decorator. If you know different, please let us know in the comments!

因此,我的推理是我應該先應用 @total_ordering 裝飾器,然後再應用 @dataclass 裝飾器。如果您知道不同的方法,請在評論中讓我們知道!

備註 NOTE:

With all these examples we looked at for customizing the sort order, there is actually a problem - and that’s because our equality definition and ordering comparisons (<, etc) are not really compatible!

在我們為自定義排序所看到的所有這些示例中,實際上存在一個問題 - 那就是我們的相等定義和排序比較(<等)實際上並不兼容!

Take a look at this example:

看看這個例子:

c1 = CircleD(1, 1, 10)
c2 = CircleD(1, 1, 20)

From a sorting perspective we would consider these two circles to be equal - but from an actual equality standapoint they are obviously not equal since the radius is different:

從排序的角度來看,我們會認為這兩個圓形是相等的 - 但從實際的相等性來看,它們顯然是不相等的,因為半徑是不同的:

c1 == c2
False
c1 <= c2
False

Hmm… That’s obviously wrong!

嗯… 這顯然是錯誤的!

To fix this we really should not use @total_ordering, since it uses __eq__ for the equality part of the comparisons - which means we’ll need to define all the ordering special functions ourselves. Even more code!! For simplicty, and to keep this video down to about an hour I did not do this - so here’s the actual code we would need to fix this issue (and would be needed for both the dataclass and the regular custom class of course).

要修復這個問題,我們真的不應該使用 @total_ordering,因為它使用 __eq__ 來進行比較的相等性部分 - 這意味著我們需要自己定義所有的排序特殊函數。更多的程式碼!!為了簡單起見,並且為了讓這個影片的長度保持在一個小時左右,我沒有這樣做 - 因此,這裡是我們需要修復此問題的實際程式碼(當然,資料類別和常規自訂類別都需要)。

@dataclass(frozen=True)
class CircleD:
    x: int = 0
    y: int = 0
    radius: int = 1
        
    @property
    def area(self):
        return pi * self.radius ** 2
    
    def circumference(self):
        return 2 * pi * self.radius
    
    @staticmethod
    def _dist_from_origin(c):
        return dist((0, 0), (c.x, c.y))
    
    def __lt__(self, other):
        if self.__class__ == other.__class__:
            return self._dist_from_origin(self) < self._dist_from_origin(other)
        return NotImplemented
    
    def __le__(self, other):
        if self.__class__ == other.__class__:
            return self._dist_from_origin(self) <= self._dist_from_origin(other)
        return NotImplemented
    
    def __gt__(self, other):
        if self.__class__ == other.__class__:
            return self._dist_from_origin(self) > self._dist_from_origin(other)
        return NotImplemented
    
    def __ge__(self, other):
        if self.__class__ == other.__class__:
            return self._dist_from_origin(self) >= self._dist_from_origin(other)
        return NotImplemented

Now we should get more consistent results:

現在我們應該會得到更一致的結果:

c1 = CircleD(1, 1, 10)
c2 = CircleD(1, 1, 20)

The circles are still not equal:

這些圓形仍然不相等:

c1 == c2
False

But the ordering comparisons work more as expected:

但排序比較的結果更符合預期:

c1 < c2
False
c1 <= c2
True
c1 > c2
False
c1 >= c2
True

關鍵字參數初始化 (Keyword-Only Initializer Arguments)

When we write a custom class, we can customize our __init__ method to require certain arguments to be keyword-only.

當我們編寫自訂的類別物件時,我們可以自定義我們的 __init__ 方法,以要求某些引數僅為關鍵字引數。

We might want to do something like this for our circle class (not sure why we would want to do this in this example, but this is just to illustrate things:

我們可能想要為我們的圓形類別物件做這樣的事情(不確定為什麼我們要在這個例子中這樣做,但這只是為了說明事情:

@total_ordering
class Circle:
    def __init__(self, x: int = 0, y: int = 0, *, radius: int = 1,):
        self._x = x
        self._y = y
        self._radius = radius

    @property
    def x(self):
        return self._x
    
    @property
    def y(self):
        return self._y
    
    @property
    def radius(self):
        return self._radius
    
    def __repr__(self):
        return f"{self.__class__.__qualname__}(x={self.x}, y={self.y}, radius={self.radius})"
    
    def __eq__(self, other):
        if self.__class__ == other.__class__:
            return (self.x, self.y, self.radius) == (other.x, other.y, other.radius)
        return NotImplemented
    
    def __hash__(self):
        return hash((self.x, self.y, self.radius))
    
    def __lt__(self, other):
        if self.__class__ == other.__class__:
            return (self.x, self.y, self.radius) < (other.x, other.y, other.radius)
        return NotImplemented
    
    def asdict(self):
        return {
            'x': self.x,
            'y': self.y,
            'radius': self.radius
        }
    
    def astuple(self):
        return self.x, self.y, self.radius

So now, the only way to specify a custom radius is to pass it as a named argument:

因此,現在指定自訂半徑的唯一方法是將其作為命名引數傳遞:

c = Circle(radius=2)
c
Circle(x=0, y=0, radius=2)

And we can no longer do this:

我們也不能這樣做:

try:
    Circle(0, 0, 2)
except TypeError as ex:
    print(f"TypeError:{ex}")
TypeError:Circle.__init__() takes from 1 to 3 positional arguments but 4 were given

We can achieve the same thing in dataclasses by using something to indicate, in our attribute declarations a boundary between positional and keyword-only arguments (just like the * did in our __init__ methods definition).

我們可以通過在屬性宣告中使用某些東西來實現相同的功能,在屬性宣告中指示位置引數和僅關鍵字引數之間的邊界(就像 __init__ 方法定義中的 * 一樣)。

from dataclasses import KW_ONLY
@dataclass(frozen=True, order=True)
class CircleD:
    x: int = 0
    y: int = 0
    _: KW_ONLY
    radius: int = 1

And now we get the same functionality:

現在我們可以獲得相同的功能:

c = CircleD(0, 0, radius=2)
try:
    Circle(0, 0, 2)
except TypeError as ex:
    print(f"TypeError: {ex}")
TypeError: Circle.__init__() takes from 1 to 3 positional arguments but 4 were given

If we wanted to make all the arguments in our __init__ keyword-only arguments, it’s even simpler:

如果我們想要使我們 __init__ 中的所有引數都成為關鍵字引數,那就更簡單了:

@dataclass(frozen=True, order=True, kw_only=True)
class CircleD:
    x: int = 0
    y: int = 0
    radius: int = 1
c = CircleD(x=0, y=0, radius=1)
try:
    CircleD(0, y=0, radius=1)
except TypeError as ex:
    print(f"TypeError: {ex}")
TypeError: CircleD.__init__() takes 1 positional argument but 2 positional arguments (and 2 keyword-only arguments) were given

Why does this error message state that __init__ takes 1 positional argument and 2 were provided?

為什麼這個錯誤訊息說 __init__ 接受 1 個位置引數,而我們提供了 2 個呢?

It should take none, and we only gave it one.

它應該不接受任何引數,而我們只給了它一個。

Don’t forget that any method in a class always starts with one positional argument because the function is a bound method - that self argument. So it expects one (the bound object, which Python supplies for us), and since we also passed in 0 as a positional argument, it ends up with two positional arguments, when only one is allowed.

不要忘記類別中的任何方法始終以一個位置引數開始,因為該函數是一個綁定方法 - 也就是 self 引數。因此它期望一個(綁定的物件,Python 為我們提供),因為我們還將 0 作為位置引數傳遞,所以它最終有兩個位置引數,而只允許一個。

資料類別與具名元組在資源使用與效能上的比較 (Resource Utilization / Performance - Dataclasses vs NamedTuples)

I stated right in the beginning that I would not compare and contrast dataclasses, named tuples, attrs and Pydantic objects.

我在一開始就說過,我不會比較和對比資料類別、具名元組、attrs 和 Pydantic 物件。

However, I have come across people that categorically reject named tuples and will only use dataclasses.

但是,我遇到過一些絕對拒絕具名元組的人,他們只會使用資料類別。

My inital, knee-jerk reaction was along the lines of this old saying: when you have a shiny new hammer, everything looks like a nail. (I often use that saying when talking about metaclasses also!)

我的最初反應是這句老話:當你有一把閃亮的新錘子時,一切都像是釘子。(我在談論元類別時也經常使用這句話!)

My initial reaction was that named tuples are more “lightweight” and provide better performance.

我的最初反應是,具名元組更“輕量級”,並且提供更好的效能。

But this was just a (not totally unfounded) guess.

但這只是一個(不完全沒有根據的)猜測。

So, I wanted to look into that a bit more and see for myself.

所以,我想更深入地研究一下,並自己看看。

Let’s create a simple data class:

讓我們創建一個簡單的資料類別:

@dataclass
class PointD1:
    x: int = 0
    y: int = 0

And let’s create a named tuple:

緊接著讓我們創建一個具名元組:

from collections import namedtuple

PointNT1 = namedtuple("PointNT1", "x y")
from typing import NamedTuple

class PointNT2(NamedTuple):
    x: int = 0
    y: int = 0

Remember, I am looking at the use case of a callable returning multiple results.

記住,我正在研究可調用函數返回多個結果的用例。

Usually this done by simply returning a plain tuple - but the named tuple gives us the advantage of being able to reference values by name, as well as by index.

通常這是通過返回一個普通的元組來完成的 - 但是具名元組使我們能夠通過名稱和索引引用值。

Of course this can be done using a dataclass - we could even make the dataclass immutable if we wanted to, and we could access values by index by using the astuple function - but this use case really only cares about accessing the fields by name (otherwise why not just use a plain tuple?).

當然,這可以使用資料類別來完成 - 我們甚至可以使資料類別變為不可變的,如果我們想要的話,我們可以通過使用 astuple 函數來通過索引訪問值 - 但是這個用例真的只關心通過名稱訪問欄位(否則為什麼不使用普通的元組?)。

Before someone accuses me of cheating, there is a way to make storage or dataclasses a bit more efficient, by using slots. I won’t get into what slots are here (I do in my deep dive course), but let’s use this as well.

在有人指責我作弊之前,有一種方法可以通過使用 slots 來使資料類別的存儲更有效率。我不會在這裡討論 slots 是什麼(我在我的深入課程中討論過),但是讓我們也使用這個。

@dataclass(slots=True)
class PointD2:
    x: int = 0
    y: int = 0

And lastly, since immutability may be something you want from a function’s return values, let’s do that variant too with data classes:

最後,由於不可變性可能是您希望從函數的返回值中獲得的,因此讓我們也使用資料類別來做這個變體:

@dataclass(frozen=True)
class PointD3:
    x: int = 0
    y: int = 0
@dataclass(frozen=True, slots=True)
class PointD4:
    x: int = 0
    y: int = 0

Ok, so to recap what we have:

好的,所以讓我們來總結一下我們所擁有的:

  • PointNT1: named tuple created using collections.namedtuple

  • PointNT1:使用 collections.namedtuple 創建的具名元組

  • PointNT2: named tuple created using the more modern (with type hints) typing.NamedTuple class

  • PointNT2:使用更現代的(帶有類型提示)typing.NamedTuple 類別創建的具名元組

  • PointD1: dataclass, mutable, no slots

  • PointD1:資料類別,可變,無 slots

  • PointD2: dataclass, mutable, slots

  • PointD2:資料類別,可變,slots

  • PointD3: dataclass, frozen, no slots

  • PointD3:資料類別,不可變,無 slots

  • PointD4: dataclas, frozen, slots

  • PointD4:資料類別,不可變,slots

Let’s create an instance of each one:

讓我們創建每個類別物件的一個實例:

pnt1 = PointNT1(1, 2)
pnt2 = PointNT2(1, 2)
pd1 = PointD1(1, 2)
pd2 = PointD2(1, 2)
pd3 = PointD3(1, 2)
pd4 = PointD4(1, 2)

Let’s look at the size of each one of those objects:

讓我們來看看這些物件的大小:

Getting the “size” of an object can be tricky, since we need to not only look at the object, but the data it contains too.

獲取物件的“大小”可能很棘手,因為我們不僅需要查看物件,還需要查看它包含的資料。

To make life simpler, I am going to use the objsize library (you’ll need to pip install it if you’re following along):

為了讓生活更簡單,我將使用 objsize 這個元件庫(如果您正在跟隨,您需要 pip 安裝它):

%pip install -U objsize
Defaulting to user installation because normal site-packages is not writeable
Collecting objsize
  Downloading objsize-0.7.0-py3-none-any.whl.metadata (12 kB)
Downloading objsize-0.7.0-py3-none-any.whl (11 kB)
Installing collected packages: objsize
Successfully installed objsize-0.7.0

And we can now use it this way:

現在我們可以這樣使用它:

from objsize import get_deep_size
get_deep_size((1, 2, 3)), get_deep_size([1, 2, 3])
(148, 172)

Let’s look at the memory for each variant:

讓我們來看看每個變體的記憶體:

print("NT1", get_deep_size(pnt1))
print("NT1", get_deep_size(pnt2))
print("D1", get_deep_size(pd1))
print("D2", get_deep_size(pd2))
print("D3", get_deep_size(pd3))
print("D4", get_deep_size(pd4))
NT1 112
NT1 112
D1 104
D2 104
D3 104
D4 104

As we can observe, and as we might have expected, the slotted dataclasses are a bit more efficient when it comes to storage (in our scenario we are saving 8 byes per instance - not exactly earth shattering, one way or the other).

正如我們所觀察到的,也正如我們可能預期的那樣,當涉及到存儲時,slots 資料類別更有效率(在我們的情況下,我們每個實例節省了 8 個位元組 - 不管怎樣,這並不是什麼大不了的事)。

But what really surprised me was that the named tuples were not far more efficient with memory overhead than more full-featured classes.

但是真正讓我驚訝的是,具名元組的記憶體開銷並不比更完整的類別物件更有效率。

Next, what about timings for attribute access (by name)? How do the two compare?

接下來,關於屬性訪問(按名稱)的時間(效能)?這兩者如何比較?

from timeit import timeit
read_attrib_pnt1 = timeit("pnt1.x", globals=globals(), number=50_000_000)
read_attrib_pnt2 = timeit("pnt2.x", globals=globals(), number=50_000_000)
read_attrib_pd1 = timeit("pd1.x", globals=globals(), number=50_000_000)
read_attrib_pd2 = timeit("pd2.x", globals=globals(), number=50_000_000)
read_attrib_pd3 = timeit("pd3.x", globals=globals(), number=50_000_000)
read_attrib_pd4 = timeit("pd4.x", globals=globals(), number=50_000_000)
print(f"pnt1: {read_attrib_pnt1:.5f}")
print(f"pnt2: {read_attrib_pnt2:.5f}")
print(f"pd1: {read_attrib_pd1:.5f}")
print(f"pd2: {read_attrib_pd2:.5f}")
print(f"pd3: {read_attrib_pd3:.5f}")
print(f"pd4: {read_attrib_pd4:.5f}")
pnt1: 1.74130
pnt2: 1.74217
pd1: 1.79664
pd2: 1.08358
pd3: 1.92938
pd4: 1.10820

So, the interesting result here is that dataclasses do not seem to incur much, if any, memory overhead, and attribute access appears faster for dataclasses than for named tuples.

因此,這裡的有趣結果是,資料類別似乎沒有產生太多的記憶體開銷,如果有的話,資料類別的屬性訪問速度似乎比具名元組更快。

The other thing that’s kind of important is the amount of time it takes to create a named tuple instance vs a dataclass instance.

另一件重要的事情是創建具名元組實例所需的時間與創建資料類別實例所需的時間。

create_pnt1 = timeit("PointNT1(1, 2)", globals=globals(), number=1_000_000)
create_pnt2 = timeit("PointNT2(1, 2)", globals=globals(), number=1_000_000)
create_pd1 = timeit("PointD1(1, 2)", globals=globals(), number=1_000_000)
create_pd2 = timeit("PointD2(1, 2)", globals=globals(), number=1_000_000)
create_pd3 = timeit("PointD3(1, 2)", globals=globals(), number=1_000_000)
create_pd4 = timeit("PointD4(1, 2)", globals=globals(), number=1_000_000)
print(f"pnt1: {create_pnt1:.5f}")
print(f"pnt2: {create_pnt2:.5f}")
print(f"pd1: {create_pd1:.5f}")
print(f"pd2: {create_pd2:.5f}")
print(f"pd3: {create_pd3:.5f}")
print(f"pd4: {create_pd4:.5f}")
pnt1: 0.36282
pnt2: 0.35463
pd1: 0.27498
pd2: 0.20434
pd3: 0.54290
pd4: 0.53302

So, this is interesting too - creating instances of a mutable data class is also faster then creating a named tuple instance.

因此,這也很有趣 - 創建可變資料類別的實例也比創建具名元組實例更快。

Let’s compare all the different timings in one table:

讓我們在一個表格中比較所有不同的時間:

%pip install -U tabulate
Defaulting to user installation because normal site-packages is not writeable
Collecting tabulate
  Downloading tabulate-0.9.0-py3-none-any.whl (35 kB)
Installing collected packages: tabulate
Successfully installed tabulate-0.9.0
from tabulate import tabulate, tabulate_formats
data = [
    ['Object', 'Size', 'Create', 'Read Attrib'],
    ['collections.namedtuple', get_deep_size(pnt1), create_pnt1, read_attrib_pnt1],
    ['typing.NamedTuple', get_deep_size(pnt2), create_pnt2, read_attrib_pnt2],
    ['dataclass (mutable)', get_deep_size(pd1), create_pd1, read_attrib_pd1],
    ['dataclass (mutable, slots)', get_deep_size(pd2), create_pd2, read_attrib_pd2],
    ['dataclass (frozen)', get_deep_size(pd3), create_pd3, read_attrib_pd3],
    ['dataclass (frozen, slots)', get_deep_size(pd4), create_pd4, read_attrib_pd4],
]

print(tabulate(data, headers="firstrow", tablefmt="fancy_outline"))
╒════════════════════════════╤════════╤══════════╤═══════════════╕
│ Object                     │   Size │   Create │   Read Attrib │
╞════════════════════════════╪════════╪══════════╪═══════════════╡
│ collections.namedtuple     │    112 │ 0.362822 │       1.7413  │
│ typing.NamedTuple          │    112 │ 0.354634 │       1.74217 │
│ dataclass (mutable)        │    104 │ 0.274985 │       1.79664 │
│ dataclass (mutable, slots) │    104 │ 0.20434  │       1.08358 │
│ dataclass (frozen)         │    104 │ 0.542897 │       1.92938 │
│ dataclass (frozen, slots)  │    104 │ 0.53302  │       1.1082  │
╘════════════════════════════╧════════╧══════════╧═══════════════╛

Overall, seems like the better option if I were serializing something like rows from a database into some structure, or returning structured data from a callable, would be to use a mutable dataclass with slots.

總的來說,如果我要將資料庫中的資料行序列化為某種結構,或者從可調用的函數返回結構化資料,那麼使用具有 slots 的可變資料類別似乎是更好的選擇。

Not a conclusion I was expecting when I first started looking at this to be honest.

說實話,當我第一次開始研究這個問題時,我並沒有預料到這個結論。

So, am I going to switch to dataclasses instead fo named tuples for function return values? Probably not, I’m set in my ways, and I do like the ability to create a named tuple using a single line of code:

那麼,我會改用資料類別來替代具名元組作為函數返回值嗎?可能不會,我已經設定好了我的方式,而且我確實喜歡使用一行程式碼來創建具名元組的能力:

Point = namedtuple('Point', 'x y z')

But I have to admit this has made me rethink named tuples vs dataclasses!

但我必須承認,這讓我重新思考了具名元組與資料類別!

Did I get something wrong with these comparisons? Let me know in the comments.

我在這些比較中有什麼錯誤嗎?在評論中讓我知道。

結論與未來的影片 (Conclusion and Future Video)

We saw how to create dataclasses, and customize them quite a bit.

我們看到了如何創建資料類別,並對它們進行了相當多的自定義。

But there are a whole lot more finer grained customizations we can add to dataclasses by adding declarations to the fields defined in the dataclass body, a few more options in the @dataclass decorator, as well as an interesting new special method available for dataclasses, called __post_init__. We also have the option to add metadata to each field in the dataclass, as well as a few other odds and ends.

但是,我們可以通過在資料類別主體中定義的欄位中添加宣告、在 @dataclass 裝飾器中添加更多選項,以及一個有趣的新特殊方法(稱為 __post_init__)來為資料類別添加更多細粒度的自定義。我們還可以為資料類別中的每個欄位添加元資料,以及其他一些零星的東西。

For all the functionality we saw with dataclasses, keep in mind that this is basically a subset of attrs.

對於我們在資料類別中看到的所有功能,請記住,這基本上是 attrs 的一個子集。

But this is more than enough for a single video, so I’m going to leave things here and come back with a second installment on more advanced features of dataclasses in the near future. Stay tuned!

但是這已經足夠了,所以我要在這裡結束,並在不久的將來回來,介紹資料類別的更高級功能。敬請關注!

1個讚