Subscribe to LXDAO

<100 subscribers

Subscribe to LXDAO

<100 subscribers

DeepFunding 的 Mini-contest #1 & #2 已吸引约 200 名参赛者，其中多位选手提供了值得学习的方法。本文将分析两位突出参赛者 davidgasquez 和 Allan Niemerg 的代码和策略。

什么是 DeepFunding？见

https://mp.weixin.qq.com/s?__biz=MzI2NzExNTczMw%3D%3D&mid=2653294746&idx=1&sn=3323532273f52a64da043255a7ace184&scene=21#wechat_redirect

下面将以 davidgasquez 和 Allan Niemerg 的代码为例，为你解析 Deepfunding 参赛和分配思路！

repo：

https://github.com/davidgasquez/predictive-funding-challenge

https://github.com/aniemerg/mini-contest

数据资源

Mini-contest #1 训练集：https://github.com/deepfunding/mini-contest/blob/main/dataset.csv
Mini-contest #2 训练集：https://cryptopond.xyz/modelfactory/detail/306250?tab=1
开源 Repo 关系依赖图谱：https://cosmograph.app/run/?data=https://raw.githubusercontent.com/opensource-observer/insights/refs/heads/main/community/deep_funder/data/unweighted_graph.csv&source=seed_repo_name&target=package_repo_name&gravity=0.25&repulsion=1&repulsionTheta=1&linkSpring=1&linkDistance=10&friction=0.1&renderLabels=true&renderHoveredLabel=true&renderLinks=true&linkArrows=true&curvedLinks=true&nodeSizeScale=0.5&linkWidthScale=1&linkArrowsSizeScale=1&nodeSize=size-default&nodeColor=color-outgoing links&linkWidth=width-number of data records&linkColor=color-number of data records&

训练流程

两位参赛者均采用四步法：

数据准备
特征工程
特征选择与参数优化
模型训练

数据预处理

虽然两位选手的训练流程基本一致，但是思路和方案还是有各自的特点，其中 davidgasquez 采用的是更多“传统”的数据特征，也就是一些项目的基础信息（stars, forks, watchers等）、和时间衰减、比率等特征；而 Allan Niemerg 在传统数据之余，还通过 GPT-4 分析项目文档来提取特征，所以抓取所有参与项目的代码文件。

https://github.com/davidgasquez/predictive-funding-challenge/blob/main/src/data.py

davidgasquez 在这一部份拉取了竞赛官方提供的一些基础特征之外，使用 Github 官方的 API 拉取了一些更多特征，如：

时间维度特征
项目年龄：age_days
最近更新时间：days_since_update
时间衰减特征：使用多个衰减率(0.0001, 0.001, 0.01)计算

时间衰减特征示例
stars_decay = stars * exp(-rate * days_since_update)

比率维度特征：
项目间的相对指标：stars_ratio、forks_ratio等
时间标准化指标：stars_per_day、forks_per_day
对数转换特征：log_stars、log_forks等
交互特征
项目活跃度：stars_issues_interaction
用户参与度：engagement_score

以上特征会在不同的维度反映项目的一些实际情况。

Allan Niemerg 通过获取所有仓库的代码，然后把代码提交给 GPT-4 进行评估，评估分为以下纬度：技术复杂性、Web3 关注度、开发者工具、项目成熟度、社区规模、企业级准备、社区参与度、文档质量、代码质量、项目声望、企业 vs 社区、安全性、创新性、性能、模块化及可访问性等。

这是 Allan 的 Prompt：

def analyze_project(readme_content, client): prompt = 'Rate this open source project on each dimension (1-5). Being decisive is better than being neutral. Use 3 only when there is no signal. Return JSON format: {"technical_complexity": N, "web3_focus": N, "developer_tool": N, "project_maturity": N, "community_size": N, "enterprise_ready": N, "community_engagement": N, "documentation": N, "code_quality": N, "status": N, "corporate": N, "security": N, "innovation": N, "performance": N, "modularity": N, "accessibility": N, "key_features": ["1-3 key features"]} Scale: 1=Very Low, 2=Low, 3=Medium/No Signal, 4=High, 5=Very High. Project Documentation: ' + readme_content response = client.chat.completions.create(messages=[{"role": "user", "content": prompt}], model="gpt-4", temperature=0.7) try: return response.choices[0].message.content except (AttributeError, IndexError) as e: return f"Error analyzing project: {str(e)}"

特征工程与训练

davidgasquez 通过通过交换 A/B 项目扩充训练集的方式，镜像扩大了训练样本空间，同时确保模型对项目顺序不敏感。

镜像训练数据
df_train = pl.concat([df_train, df_train.select("id", pl.col("project_b").alias("project_a"), pl.col("project_a").alias("project_b"), pl.col("weight_b").alias("weight_a"), pl.col("weight_a").alias("weight_b"))])

准备好数据之后，开始特征工程，依次添加各类特征：

df_train_full = (df_train .pipe(add_github_projects_data)     # 基础GitHub特征 .pipe(extract_temporal_features)     # 时间相关特征 .pipe(extract_activity_features)     # 活动度特征 .pipe(add_target_encoding)          # 目标编码特征 .pipe(extract_ratio_features)       # 比率特征 )

标记好之后通过 LightGBM 的特征重要性评估筛选最有价值的特征，降低模型复杂度。

获取初始特征列表
features = get_features() X = df_train_full.select(features).to_numpy() y = df_train_full.get_column("weight_a").to_numpy()
特征筛选
selected_features = select_features(X, y, features) X = df_train_full.select(selected_features).to_numpy()
方法具体逻辑见源码

然后使用 Optuna 进行贝叶斯优化，优化参数包括：

树的结构参数（num_leaves）
学习率（learning_rate）
采样参数（feature_fraction, bagging_fraction）
正则化参数（reg_alpha, reg_lambda）

优化之后进行模型验证，使用 5 折交叉验证评估模型性能，确保模型的稳定性和泛化能力。

交叉验证评估
mean_mse, std_mse = train_and_evaluate(X, y, best_params)

通过均值和标准差评估模型的稳定性和泛化能力。然后训练得到最后的模型，得出最终的训练结果

Allan Niemerg 使用了 XGBoost 模型来预测项目偏好，这是他使用的一些关键参数：

xgb_improved = XGBRegressor( max_depth=6,              # 树的最大深度，允许模型捕捉更复杂的模式 min_child_weight=1,       # 允许更小的叶节点 gamma=0,                  # 分裂所需的最小损失减少 subsample=0.8,            # 每棵树使用80%的数据，减少过拟合 colsample_bytree=0.8,     # 每棵树使用80%的特征 reg_alpha=0,              # L1 正则化 reg_lambda=1,             # L2 正则化 learning_rate=0.05,       # 较低的学习率可以使模型更稳定 n_estimators=200,         # 树的数量，更多的树可以提高模型的表现 random_state=42           # 随机种子，确保结果可重复 )

稳定性和泛化能力是评估模型性能的关键指标。Allan Niemerg 借助 XGBoost 模型完成了这一任务，并通过均值和标准差评估模型的稳定性后，训练得出了最终的结果。他使用了一组经过优化的关键参数，例如将 max_depth 设置为 6，以平衡模型的复杂性和过拟合风险；subsample 和 colsample_bytree 均设置为 0.8，分别控制数据和特征的采样比例，进一步提升模型的泛化能力；learning_rate 设为 0.05，确保模型更新稳定且准确。此外，他还通过 n_estimators=200 确定树的数量，从而在模型表现和计算成本之间取得平衡。这些参数的搭配，结合合理的正则化（reg_alpha=0 和 reg_lambda=1），使模型能够在复杂的预测任务中表现出色，同时保持较好的稳健性和可重复性。

训练算法对比：

Allan Niemerg 使用 XGBoost，而 davidgasquez 使用 LightGBM

XGBoost 和 LightGBM 都是基于梯度提升决策树（GBDT）的机器学习算法，广泛应用于分类和回归任务中。

XGBoost 由 Chen Tianqi 开发，以高效、准确和灵活著称，尤其在 Kaggle 社区中表现亮眼。它通过串行构建多棵树，不断优化预测误差，同时支持 L1 和 L2 正则化以防止过拟合，并具有较高的计算效率和灵活性。

LightGBM 则由微软推出，专注于大规模数据的高效处理。它采用叶子节点分裂（leaf-wise）策略和直方图优化技术，大幅提升了训练速度和内存利用率，同时对稀疏数据和类别特征的支持更优。

两者各有优势：XGBoost 更注重准确性和灵活性，LightGBM 在大数据场景下效率更高。选择算法时应根据数据规模和任务需求进行权衡。

其他推荐模型

对于类似的预测任务，可以考虑：

CatBoost
矩阵分解（Matrix Factorization）
深度因子分解机（DeepFM）
神经协同过滤（NCF）

结论

两种方法都展示了全面特征工程和谨慎模型选择的重要性。davidgasquez 的方法在传统指标和数据增强方面表现出色，而 Allan Niemerg 通过集成 GPT-4 分析为项目评估带来了创新维度。看了两位选手的方案，如果你也有不错的 idea ，可以点击下方链接参加 DeepFunding 的 mini-context：

https://huggingface.co/spaces/DeepFunding/PredictiveFundingChallengeforOpenSourceDependencies

https://cryptopond.xyz/modelfactory/detail/306250?tab=0

同时欢迎加入由 LXDAO 和 ETHPanda 联合发起了 DeepFunding 中文力量一起交流一些 Deep Funding 的系列问题：

https://t.me/deepfundingcn

什么是 DeepFunding？见

https://mp.weixin.qq.com/s?__biz=MzI2NzExNTczMw%3D%3D&mid=2653294746&idx=1&sn=3323532273f52a64da043255a7ace184&scene=21#wechat_redirect

下面将以 davidgasquez 和 Allan Niemerg 的代码为例，为你解析 Deepfunding 参赛和分配思路！

repo：

https://github.com/davidgasquez/predictive-funding-challenge

https://github.com/aniemerg/mini-contest

数据资源

Mini-contest #1 训练集：https://github.com/deepfunding/mini-contest/blob/main/dataset.csv
Mini-contest #2 训练集：https://cryptopond.xyz/modelfactory/detail/306250?tab=1
开源 Repo 关系依赖图谱：https://cosmograph.app/run/?data=https://raw.githubusercontent.com/opensource-observer/insights/refs/heads/main/community/deep_funder/data/unweighted_graph.csv&source=seed_repo_name&target=package_repo_name&gravity=0.25&repulsion=1&repulsionTheta=1&linkSpring=1&linkDistance=10&friction=0.1&renderLabels=true&renderHoveredLabel=true&renderLinks=true&linkArrows=true&curvedLinks=true&nodeSizeScale=0.5&linkWidthScale=1&linkArrowsSizeScale=1&nodeSize=size-default&nodeColor=color-outgoing links&linkWidth=width-number of data records&linkColor=color-number of data records&

训练流程

两位参赛者均采用四步法：

数据准备
特征工程
特征选择与参数优化
模型训练

数据预处理

https://github.com/davidgasquez/predictive-funding-challenge/blob/main/src/data.py

davidgasquez 在这一部份拉取了竞赛官方提供的一些基础特征之外，使用 Github 官方的 API 拉取了一些更多特征，如：

时间维度特征
项目年龄：age_days
最近更新时间：days_since_update
时间衰减特征：使用多个衰减率(0.0001, 0.001, 0.01)计算

时间衰减特征示例
stars_decay = stars * exp(-rate * days_since_update)

比率维度特征：
项目间的相对指标：stars_ratio、forks_ratio等
时间标准化指标：stars_per_day、forks_per_day
对数转换特征：log_stars、log_forks等
交互特征
项目活跃度：stars_issues_interaction
用户参与度：engagement_score

以上特征会在不同的维度反映项目的一些实际情况。

这是 Allan 的 Prompt：

def analyze_project(readme_content, client): prompt = 'Rate this open source project on each dimension (1-5). Being decisive is better than being neutral. Use 3 only when there is no signal. Return JSON format: {"technical_complexity": N, "web3_focus": N, "developer_tool": N, "project_maturity": N, "community_size": N, "enterprise_ready": N, "community_engagement": N, "documentation": N, "code_quality": N, "status": N, "corporate": N, "security": N, "innovation": N, "performance": N, "modularity": N, "accessibility": N, "key_features": ["1-3 key features"]} Scale: 1=Very Low, 2=Low, 3=Medium/No Signal, 4=High, 5=Very High. Project Documentation: ' + readme_content response = client.chat.completions.create(messages=[{"role": "user", "content": prompt}], model="gpt-4", temperature=0.7) try: return response.choices[0].message.content except (AttributeError, IndexError) as e: return f"Error analyzing project: {str(e)}"

特征工程与训练

davidgasquez 通过通过交换 A/B 项目扩充训练集的方式，镜像扩大了训练样本空间，同时确保模型对项目顺序不敏感。

镜像训练数据
df_train = pl.concat([df_train, df_train.select("id", pl.col("project_b").alias("project_a"), pl.col("project_a").alias("project_b"), pl.col("weight_b").alias("weight_a"), pl.col("weight_a").alias("weight_b"))])

准备好数据之后，开始特征工程，依次添加各类特征：

df_train_full = (df_train .pipe(add_github_projects_data)     # 基础GitHub特征 .pipe(extract_temporal_features)     # 时间相关特征 .pipe(extract_activity_features)     # 活动度特征 .pipe(add_target_encoding)          # 目标编码特征 .pipe(extract_ratio_features)       # 比率特征 )

标记好之后通过 LightGBM 的特征重要性评估筛选最有价值的特征，降低模型复杂度。

获取初始特征列表
features = get_features() X = df_train_full.select(features).to_numpy() y = df_train_full.get_column("weight_a").to_numpy()
特征筛选
selected_features = select_features(X, y, features) X = df_train_full.select(selected_features).to_numpy()
方法具体逻辑见源码

然后使用 Optuna 进行贝叶斯优化，优化参数包括：

树的结构参数（num_leaves）
学习率（learning_rate）
采样参数（feature_fraction, bagging_fraction）
正则化参数（reg_alpha, reg_lambda）

优化之后进行模型验证，使用 5 折交叉验证评估模型性能，确保模型的稳定性和泛化能力。

交叉验证评估
mean_mse, std_mse = train_and_evaluate(X, y, best_params)

通过均值和标准差评估模型的稳定性和泛化能力。然后训练得到最后的模型，得出最终的训练结果

Allan Niemerg 使用了 XGBoost 模型来预测项目偏好，这是他使用的一些关键参数：

xgb_improved = XGBRegressor( max_depth=6,              # 树的最大深度，允许模型捕捉更复杂的模式 min_child_weight=1,       # 允许更小的叶节点 gamma=0,                  # 分裂所需的最小损失减少 subsample=0.8,            # 每棵树使用80%的数据，减少过拟合 colsample_bytree=0.8,     # 每棵树使用80%的特征 reg_alpha=0,              # L1 正则化 reg_lambda=1,             # L2 正则化 learning_rate=0.05,       # 较低的学习率可以使模型更稳定 n_estimators=200,         # 树的数量，更多的树可以提高模型的表现 random_state=42           # 随机种子，确保结果可重复 )

训练算法对比：

Allan Niemerg 使用 XGBoost，而 davidgasquez 使用 LightGBM

XGBoost 和 LightGBM 都是基于梯度提升决策树（GBDT）的机器学习算法，广泛应用于分类和回归任务中。

两者各有优势：XGBoost 更注重准确性和灵活性，LightGBM 在大数据场景下效率更高。选择算法时应根据数据规模和任务需求进行权衡。

其他推荐模型

对于类似的预测任务，可以考虑：

CatBoost
矩阵分解（Matrix Factorization）
深度因子分解机（DeepFM）
神经协同过滤（NCF）

结论

https://huggingface.co/spaces/DeepFunding/PredictiveFundingChallengeforOpenSourceDependencies

https://cryptopond.xyz/modelfactory/detail/306250?tab=0

同时欢迎加入由 LXDAO 和 ETHPanda 联合发起了 DeepFunding 中文力量一起交流一些 Deep Funding 的系列问题：

https://t.me/deepfundingcn

LXDAO

More from LXDAO

LXDAO

More from LXDAO

No activity yet

More from LXDAO

LXDAO

LXDAO

No activity yet

More from LXDAO

如何参加 Deepfunding 比赛？参赛代码一步步解析

如何参加 Deepfunding 比赛？参赛代码一步步解析

No activity yet

No activity yet

数据资源

训练流程

数据预处理

特征工程与训练

训练算法对比：

结论

数据资源

训练流程

数据预处理

特征工程与训练

训练算法对比：

结论