使用Python的7个 itertools 函数简化特征工程-青云TOP-AI综合资源站平台|青云聚合API大模型调用平台|全网AI资源导航平台

📢 转载信息

原文链接：https://machinelearningmastery.com/7-essential-python-itertools-for-feature-engineering/

原文作者：Bala Priya C

在本文中，您将学习如何使用Python的itertools模块，通过简洁高效的模式来简化常见的特征工程任务。

我们将涵盖的主题包括：

使用itertools生成交互特征、多项式特征和累积特征。
构建查找网格、滞后窗口和分组聚合，以支持结构化数据工作流。
利用基于迭代器的工具编写更清晰、更具组合性的特征工程代码。

让我们开始吧。

7 Essential Python Itertools for Feature Engineering
Image by Editor

Introduction

特征工程是机器学习中大部分实际工作发生的地方。一个好的特征通常比更换算法更能改进模型。然而，这一步通常会导致代码混乱，包含嵌套循环、手动索引、手动构建组合等等。

Python的itertools模块是一个标准的库工具集，大多数数据科学家知道它的存在，但在构建特征时却很少使用它。这是一个错失的机会，因为itertools就是为高效地处理迭代器而设计的。很多特征工程的本质，就是对变量对、滑动窗口、分组序列或特征集的每种可能子集进行结构化迭代。

在本文中，您将通过七个itertools函数来解决常见的特征工程问题。我们将使用模拟的电子商务数据，涵盖交互特征、滞后窗口、类别组合等。到最后，您将获得一套可以直接放入您自己的特征工程管道的模式。

您可以在GitHub上获取代码。

1. 使用 `combinations` 生成交互特征

交互特征捕捉了两个变量之间的关系——这是单个变量无法表达的。手动列出多列数据集中每个配对是繁琐的。itertools模块中的combinations可以在一行代码中完成。

让我们编写一个示例，使用combinations创建交互特征：

import itertools
import pandas as pd

df = pd.DataFrame({
    "avg_order_value": [142.5, 89.0, 210.3, 67.8, 185.0],
    "discount_rate": [0.10, 0.25, 0.05, 0.30, 0.15],
    "days_since_signup": [120, 45, 380, 12, 200],
    "items_per_order": [3.2, 1.8, 5.1, 1.2, 4.0],
    "return_rate": [0.05, 0.18, 0.02, 0.22, 0.08],
})

numeric_cols = df.columns.tolist()

for col_a, col_b in itertools.combinations(numeric_cols, 2):
    feature_name = f"{col_a}_x_{col_b}"
    df[feature_name] = df[col_a] * df[col_b]

interaction_cols = [c for c in df.columns if "_x_" in c]
print(df[interaction_cols].head())

输出：

   avg_order_value_x_discount_rate  avg_order_value_x_days_since_signup
0                           14.250                             17100.0
1                           22.250                              4005.0
2                           10.515                             79914.0
3                           20.340                               813.6
4                           27.750                             37000.0

   avg_order_value_x_items_per_order  avg_order_value_x_return_rate
0                             456.00                          7.125
1                             160.20                         16.020
2                            1072.53                          4.206
3                              81.36                         14.916
4                             740.00                         14.800

   days_since_signup_x_return_rate  items_per_order_x_return_rate
0                               6.00                          0.160
1                               8.10                          0.324
2                               7.60                          0.102
3                               2.64                          0.264
4                              16.00                          0.320

combinations(numeric_cols, 2)生成每个唯一对一次，没有重复。对于5列，有10对；对于10列，有45对。这种方法会随着您添加列而扩展。

2. 使用 `product` 构建跨类别特征网格

itertools.product为您提供两个或多个可迭代对象的笛卡尔积——它们之间的每种可能组合——包括不同组之间的重复项。

在我们正在处理的电子商务示例中，当您想要构建跨越客户细分和产品类别的特征矩阵时，这非常有用。

import itertools

customer_segments = ["new", "returning", "vip"]
product_categories = ["electronics", "apparel", "home_goods", "beauty"]
channels = ["mobile", "desktop"]

# 所有细分 × 类别 × 渠道的组合
combos = list(itertools.product(customer_segments, product_categories, channels))
grid_df = pd.DataFrame(combos, columns=["segment", "category", "channel"])

# 模拟每个组合的转化率查找
import numpy as np
np.random.seed(7)
grid_df["avg_conversion_rate"] = np.round(
    np.random.uniform(0.02, 0.18, size=len(grid_df)),
    3
)

print(grid_df.head(12))
print(f"\nTotal combinations: {len(grid_df)}")

输出：

  segment     category   channel  avg_conversion_rate
0     new  electronics    mobile                0.032
1     new  electronics   desktop                0.145
2     new      apparel    mobile                0.090
3     new      apparel   desktop                0.136
4     new   home_goods    mobile                0.176
5     new   home_goods   desktop                0.106
6     new       beauty    mobile                0.100
7     new       beauty   desktop                0.032
8 returning  electronics    mobile                0.063
9 returning  electronics   desktop                0.100
10 returning      apparel    mobile                0.129
11 returning      apparel   desktop                0.149

Total combinations: 24

然后，此网格可以作为查找特征合并回您的主要交易数据集中，因为每一行都将获得其特定细分 × 类别 × 渠道的预期转化率。product确保在构建该网格时没有遗漏任何有效的组合。

3. 使用 `chain` 展平多源特征集

在大多数管道中，特征来自多个来源：客户画像表、产品元数据表和浏览历史表。通常需要将它们展平为单个特征列表，以便进行列选择或验证。

import itertools

customer_features = [
    "customer_age",
    "days_since_signup",
    "lifetime_value",
    "total_orders",
    "avg_order_value"
]

product_features = [
    "category",
    "brand_tier",
    "avg_rating",
    "review_count",
    "is_sponsored"
]

behavioral_features = [
    "pages_viewed_last_7d",
    "search_queries_last_7d",
    "cart_abandonment_rate",
    "wishlist_size"
]

# 将所有特征组展平成一个列表
all_features = list(itertools.chain(
    customer_features,
    product_features,
    behavioral_features
))

print(f"Total features: {len(all_features)}")
print(all_features)

输出：

Total features: 13
['customer_age', 'days_since_signup', 'lifetime_value', 'total_orders', 'avg_order_value', 'category', 'brand_tier', 'avg_rating', 'review_count', 'is_sponsored', 'pages_viewed_last_7d', 'search_queries_last_7d', 'cart_abandonment_rate', 'wishlist_size']

itertools.chain以非常高效的方式将多个列表连接成一个单一的迭代器。这对于准备一个要传递给特征选择算法的已知特征列表非常有用。

4. 使用 `islice` 迭代器进行切片

当处理非常大的数据集或无限生成器时，您可能只需要查看前 N 个元素，或者跳过前 M 个元素，然后获取接下来的 N 个。itertools.islice允许您像切片列表一样对待任何迭代器，而无需一次将所有数据加载到内存中。

例如，假设我们有一个生成无限序列的函数，并且只想查看其中一个切片：

import itertools

def count_infinite():
    num = 0
    while True:
        yield num
        num += 1

# 获取从第10个元素开始的5个元素
slicer = itertools.islice(count_infinite(), 10, 15)

print(list(slicer))

输出：

[10, 11, 12, 13, 14]

这在您需要对一个大型文件或流进行采样时非常有用，或者当您只想处理数据的某个子集进行调试或初步分析时。

5. 使用 `groupby` 进行分组聚合

itertools.groupby对于基于某个键对数据进行分组并对每个组执行聚合操作非常强大。在特征工程中，这可以用来计算每个组的平均值、计数、总和或其他统计数据，并将其作为新特征添加回数据。

重要提示：groupby 要求输入数据首先按分组键排序。

假设我们有一个客户交易列表，我们想计算每个客户的总订单价值和订单数量：

import itertools
import pandas as pd

# 模拟交易数据
transactions = [
    {"customer_id": 1, "order_value": 100},
    {"customer_id": 2, "order_value": 50},
    {"customer_id": 1, "order_value": 150},
    {"customer_id": 3, "order_value": 200},
    {"customer_id": 2, "order_value": 75},
    {"customer_id": 1, "order_value": 120},
]

# 按 customer_id 排序
transactions.sort(key=lambda x: x["customer_id"])

# 使用 groupby 计算每个客户的总订单价值和订单数量
agg_features = []
for customer_id, group in itertools.groupby(transactions, key=lambda x: x["customer_id"]):
    group_list = list(group)  # 将组转换为列表以多次访问
    total_order_value = sum(item["order_value"] for item in group_list)
    order_count = len(group_list)
    agg_features.append({
        "customer_id": customer_id,
        "total_order_value": total_order_value,
        "order_count": order_count
    })

agg_df = pd.DataFrame(agg_features)
print(agg_df)

输出：

   customer_id  total_order_value  order_count
0            1                370            3
1            2                125            2
2            3                200            1

然后，您可以将此agg_df合并回原始数据，以添加每个客户的总消费和购买次数作为新特征。

6. 使用 `tee` 复制迭代器

有时您可能需要多次遍历同一个迭代器，但标准迭代器在消耗后就失效了。itertools.tee允许您从一个迭代器创建多个独立的迭代器，每个迭代器都可以独立地消耗。

例如，如果您想同时计算一个序列的平均值和中位数，而不想重复生成数据：

import itertools

data = iter([1, 2, 3, 4, 5])

# 创建两个独立的迭代器
data1, data2 = itertools.tee(data)

# 计算平均值
mean_val = sum(data1) / len(list(data1)) # 注意：为了计算长度，这里需要消耗data1

# 计算中位数（需要重新生成或使用另一个tee副本）
# 假设我们有data2，可以用来计算中位数
# 为了简单起见，我们直接用list(data2)来计算
data2_list = list(data2)
data2_list.sort()
median_val = data2_list[len(data2_list) // 2] if len(data2_list) % 2 != 0 else (data2_list[len(data2_list) // 2 - 1] + data2_list[len(data2_list) // 2]) / 2

print(f"Mean: {mean_val}")
print(f"Median: {median_val}")

输出：

Mean: 3.0
Median: 3

tee在您需要进行多次独立传递以计算不同统计量时非常有用，例如在特征工程中，您可能需要基于相同的原始数据计算各种不同的聚合特征。

7. 使用 `accumulate` 进行累积计算

itertools.accumulate允许您对迭代器中的元素执行累积计算。默认情况下，它执行累加和。您可以提供一个函数来执行其他类型的累积，例如累积乘积或累积最大值。

在特征工程中，这对于创建累积指标非常有用，例如客户的累计消费额，或者某个时间序列的累计总和。

计算累计求和：

import itertools

data = [1, 2, 3, 4, 5]

# 累积求和
cumulative_sum = list(itertools.accumulate(data))
print(f"Cumulative sum: {cumulative_sum}")

# 累积乘积
cumulative_product = list(itertools.accumulate(data, lambda x, y: x * y))
print(f"Cumulative product: {cumulative_product}")

输出：

Cumulative sum: [1, 3, 6, 10, 15]
Cumulative product: [1, 2, 6, 24, 120]

accumulate是构建随时间变化的特征的有效方法，比如计算股票价格的累计收益，或者用户会话中累计操作次数。

Conclusion

Python的itertools模块提供了强大的工具，可以帮助您以更清晰、更高效的方式编写特征工程代码。从生成组合和笛卡尔积，到展平列表、进行迭代器切片、分组聚合以及执行累积计算，itertools都提供了优雅的解决方案。

通过将这些itertools模式集成到您的特征工程工作流中，您可以减少样板代码，提高代码的可读性和可维护性，并最终构建出更强大的机器学习模型。

🚀 想要体验更好更全面的AI调用？

欢迎使用青云聚合API，约为官网价格的十分之一，支持300+全球最新模型，以及全球各种生图生视频模型，无需翻墙高速稳定，文档丰富，小白也可以简单操作。

目录CONTENT

使用Python的7个 itertools 函数简化特征工程

Introduction

1. 使用 `combinations` 生成交互特征

2. 使用 `product` 构建跨类别特征网格

3. 使用 `chain` 展平多源特征集

4. 使用 `islice` 迭代器进行切片

5. 使用 `groupby` 进行分组聚合

6. 使用 `tee` 复制迭代器

7. 使用 `accumulate` 进行累积计算

Conclusion

评论区

使用Python的7个 itertools 函数简化特征工程

Introduction

1. 使用 combinations 生成交互特征

2. 使用 product 构建跨类别特征网格

3. 使用 chain 展平多源特征集

4. 使用 islice 迭代器进行切片

5. 使用 groupby 进行分组聚合

6. 使用 tee 复制迭代器

7. 使用 accumulate 进行累积计算

Conclusion

评论区

1. 使用 `combinations` 生成交互特征

2. 使用 `product` 构建跨类别特征网格

3. 使用 `chain` 展平多源特征集

4. 使用 `islice` 迭代器进行切片

5. 使用 `groupby` 进行分组聚合

6. 使用 `tee` 复制迭代器

7. 使用 `accumulate` 进行累积计算