Python 3.13中free线程模式性能评估：生产环境先别用！

2024-11-12

Python 3.13中引入了三个非常关键的性能改进技术，包括：

free线程模式，即禁止global interpreter lock (GIL) 模式
新增了全新的 JIT 编译器
新增开箱即用的mimalloc分配器

最引人注目的就是free线程模式，那它的性能到底如何呢，本文分享了三个试验。

free线程模式

free线程模式也许CPython在没有GIL全局锁的情况下运行，GIL是一个防止多个线程同时执行 Python 字节码的互斥机制，它的本意是为了简化CPython 的内存管理，并使 C API 更易于使用，但在多核时代，成了非常明显的一个缺点。

为了应对GIL的情况，可以采用多进程的方式，但它有几个缺点：

内存开销较大，每个进程都有自己的内存空间
通信成本较大，因为进程间不能直接共享内存，数据在进程之间需要序列化和反序列化
启动较慢，创建新进程比线程慢得多

PageRank测试例子

为了测试free线程模式性能，可以实现一个PageRank算法，它是Google搜索引擎采用的算法，使用它作为例子非常合适：

计算密集型（矩阵运算）
可处理大量数据（web graph图）
非常适合并行化

测试方式之单线程模式

def pagerank_single(matrix: np.ndarray, num_iterations: int) -> np.ndarray:
    """单线程模式"""
    size = matrix.shape[0]
    scores = np.ones(size) / size

    for _ in range(num_iterations):
        new_scores = np.zeros(size)
        for i in range(size):
            # Get nodes that point to current node
            incoming = np.where(matrix[:, i])[0]
            for j in incoming:
                # 对不同node进行评分
                new_scores[i] += scores[j] / np.sum(matrix[j]) 
        # 计算总得分
        scores = (1 - DAMPING) / size + DAMPING * new_scores 

    return scores

这个代码中，计算量最大的部分有两部分，分别是对node迭代进行评分和计算总得分，第一部分更适合多线程模式（并行）运行。

测试方式之多线程模式

首先将矩阵分为多个块：

chunk_size = size // num_threads
chunks = [(i, min(i + chunk_size, size)) for i in range(0, size, chunk_size)]

接下去每个线程处理矩阵的不同块：

def _thread_worker(
    matrix: np.ndarray,
    scores: np.ndarray,
    new_scores: np.ndarray,
    start_idx: int,
    end_idx: int,
    lock: threading.Lock,
):
    size = matrix.shape[0]
    local_scores = np.zeros(size)

    for i in range(start_idx, end_idx):
        incoming = np.where(matrix[:, i])[0]
        for j in incoming:
            local_scores[i] += scores[j] / np.sum(matrix[j])

    with lock: 
        new_scores += local_scores

值得注意的是，更新 new_scores 数组时需要加锁，以防止出现竞赛条件。

最后就是分多线程执行：

new_scores = np.zeros(size)
lock = threading.Lock() 
with concurrent.futures.ThreadPoolExecutor(max_workers=num_threads) as executor: 
    futures = executor.map( 
        lambda args: _thread_worker(*args), # starmap isn't available on ThreadPoolExecutor
        [ 
            (matrix, scores, new_scores, start_idx, end_idx, lock) 
            for start_idx, end_idx in chunks 
        ], 
    ) 
new_scores = (1 - DAMPING) / size + DAMPING * new_scores
scores = new_scores

测试方式之多进程模式

multiprocessing模式和threading差不多。

由于进程不能直接共享内存，因此每个 Worker 现在都会返回 local_scores 数组，而不是更新共享的 new_scores 数组。然后，本地分数将在主进程中汇总：

# 合并结果
new_scores = sum(chunk_results)

虽然这应该比线程版本更快，但它仍然会产生进程间通信的开销，这对于大型数据集来说可能会变得非常重要。

使用ThreadPoolExecutor替代multiprocessing.Pool：

with multiprocessing.Pool(processes=num_processes) as pool: 
    chunk_results = pool.starmap(_process_chunk, chunks) 
    new_scores = sum(chunk_results)
    new_scores = (1 - DAMPING) / size + DAMPING * new_scores
    scores = new_scores

评估性能

def create_test_graph(size: int) -> np.ndarray:
    # Fixed seed
    np.random.seed(0) 
    matrix = np.random.choice([0, 1], size=(size, size), p=[1 - 5/size, 5/size])
    zero_outdegree = ~matrix.any(axis=1)
    zero_indices = np.where(zero_outdegree)[0]
    if len(zero_indices) > 0:
        random_targets = np.random.randint(0, size, size=len(zero_indices))
        matrix[zero_indices, random_targets] = 1

    return matrix

在测试中，使用了固定因子，以确保每次运行的可重复性，这对于比较不同实现的性能非常重要。

接下来去使用 pytest 插件 pytest-codspeed 来测量不同实现在不同参数和多个 CPython 版本下的性能。

@pytest.mark.parametrize(
    "pagerank",
    [
        pagerank_single,
        partial(pagerank_multiprocess, num_processes=8),
        partial(pagerank_multithread, num_threads=8),
    ],
    ids=["single", "8-processes", "8-threads"],
)
@pytest.mark.parametrize(
    "graph",
    [
        create_test_graph(100),
        create_test_graph(1000),
        create_test_graph(2000),
    ],
    ids=["XS", "L", "XL"],
)
def test_pagerank(
    benchmark: BenchmarkFixture,
    pagerank: PagerankFunc,
    graph: np.ndarray,
):
    benchmark(pagerank, graph, num_iterations=10)

具体测试四个基准，Python 3.12.7（GIL模式），Python 3.13.0（GIL模式），Python 3.13.0t（free线程模式，并保留GIL模式），Python 3.13.0t no GIL 仅free线程模式。

测试结果

（1）Python 3.12.7和Python 3.13.0性能比较接近，且多进程性能比多线程还差，说明进程通信开销很大。

（2）无GIL时的表现，3.13t with no GIL性能最好，说明移除GIL后实现了真正的多线程。

（3）在当前的free线程构建中，不论是否启用GIL，性能都会大幅下降。这主要是因为free线程构建模式需要关闭“自适应专用解释器”（specializing adaptive interpreter），导致性能显著降低，不过这个问题Python 3.14会有改进。

结论，目前free线程模式还不能完全在生产环境下使用，但下个版本可能会解决相关问题。

来源：虞大胆虞大胆的叽叽喳喳

THE END

python 处理json数据格式20种小技巧

<<上一篇

什么是多模态大模型？本质和技术难点有哪些？

下一篇>>