Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
116 changes: 116 additions & 0 deletions TODO_NEXT_VERSION.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
# PaperMind 版本规划 TODO

## 下一个版本 (v1.1.0) - 重点功能

### 🎯 核心功能:IEEE 文章抓取能力

**背景**:当前 PaperMind 仅支持 arXiv 文章抓取,需要扩展支持 IEEE Xplore 平台的文章获取能力。

---

### 📋 任务清单

#### 1. IEEE API 接入准备
- [ ] 申请 IEEE Xplore API Key(联系 `onlinesupport@ieee.org`)
- [ ] 确认机构订阅状态(如有)
- [ ] 阅读并理解 [IEEE API 服务条款](https://developer.ieee.org/API_Terms_of_Use2)
- [ ] 测试 API 连通性和基础查询功能

#### 2. 技术实现
- [ ] 设计多源架构(统一接口支持 arXiv、IEEE 等)
- [ ] 实现 IEEE API 客户端模块
- [ ] 元数据搜索功能
- [ ] Open Access 全文下载
- [ ] DOI 解析功能
- [ ] 集成第三方开放资源
- [ ] Unpaywall API(开放全文获取)
- [ ] Semantic Scholar API(补充元数据)
- [ ] TechRxiv 预印本检索
- [ ] 统一数据模型(兼容不同来源的论文格式)

#### 3. 合规与风险控制
- [ ] 实现请求频率限制(避免 IP 被封)
- [ ] 添加用户订阅状态检测
- [ ] 区分 Open Access 与付费文章的处理逻辑
- [ ] 编写合规使用文档

#### 4. 测试与验证
- [ ] 单元测试(API 客户端)
- [ ] 集成测试(端到端流程)
- [ ] 手动测试(真实 IEEE 文章下载)
- [ ] 性能测试(批量查询场景)

#### 5. 文档更新
- [ ] 用户文档:如何配置 IEEE API Key
- [ ] 开发文档:多源架构设计说明
- [ ] 更新 README.md 功能列表
- [ ] 编写常见问题 FAQ

---

### 📊 技术方案对比

| 方案 | 描述 | 可行性 | 优先级 |
|------|------|--------|--------|
| IEEE 官方 API (元数据) | 使用官方 API 获取文章信息 | ⭐⭐⭐⭐⭐ | P0 |
| IEEE API + Open Access | 下载 Open Access 全文 | ⭐⭐⭐⭐ | P0 |
| Unpaywall 整合 | 通过 DOI 查询开放版本 | ⭐⭐⭐⭐ | P1 |
| 机构订阅访问 | 用户自有订阅的论文下载 | ⭐⭐⭐ | P1 |
| 预印本检索 | TechRxiv/arXiv 预印本 | ⭐⭐⭐ | P2 |

---

### ⚠️ 风险与注意事项

1. **法律合规**:
- 禁止大规模下载付费文章
- 禁止重新分发 IEEE 内容
- 仅限非商业用途

2. **技术风险**:
- API Key 申请可能需要时间审批
- 无机构订阅时全文获取能力受限
- 需要处理反爬机制(如使用非官方途径)

3. **用户预期管理**:
- 明确告知用户需要自己的机构订阅
- Open Access 文章比例有限(约 10-20%)
- 提供替代方案建议(预印本、开放资源)

---

### 🔗 参考资料

- [IEEE Xplore API 文档](https://developer.ieee.org/docs/read/IEEE_Xplore_Metadata_API_Overview)
- [IEEE API 使用案例](https://developer.ieee.org/Allowed_API_Uses)
- [Unpaywall API](https://unpaywall.org/products/api)
- [Semantic Scholar API](https://www.semanticscholar.org/product/api)
- [TechRxiv 预印本平台](https://www.techrxiv.org/)

---

### 📅 预计时间线

| 阶段 | 时间 | 里程碑 |
|------|------|--------|
| API 申请与调研 | Week 1 | 获得 API Key,完成技术验证 |
| 核心开发 | Week 2-3 | IEEE 客户端完成,基础功能可用 |
| 整合测试 | Week 4 | 多源整合完成,测试通过 |
| 文档与发布 | Week 5 | 文档完善,版本发布 |

---

### 📝 调研摘要

详见调研记录(2026-03-03):
- arXiv vs IEEE 访问模式对比
- IEEE API 技术细节与限制
- 法律合规性分析
- 推荐实施方案

**核心结论**:技术上完全可行,优先采用官方 API + 多源开放资源的合规方案,避免暴力爬虫。

---

*最后更新:2026-03-03*
*创建人:老白*
160 changes: 157 additions & 3 deletions apps/api/routers/papers.py
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,90 @@ def recommended_papers(top_k: int = Query(default=10, ge=1, le=50)) -> dict:
return {"items": RecommendationService().recommend(top_k=top_k)}


@router.post("/papers/search-multi")
async def search_multi(
query: str,
channels: list[str] = Query(default=["arxiv"]),
max_results_per_channel: int = Query(default=50, ge=1, le=100),
topic_id: str | None = Query(default=None),
) -> dict:
"""多渠道并行搜索论文"""
import asyncio
import logging

from packages.integrations.aggregator import ResultAggregator
from packages.integrations.registry import ChannelRegistry

logger = logging.getLogger(__name__)

ChannelRegistry.register_default_channels()

async def fetch_channel(ch: str) -> tuple[str, list, dict]:
try:
channel = ChannelRegistry.get(ch)
if not channel:
return ch, [], {"error": "channel not found"}
papers = await asyncio.to_thread(channel.fetch, query, max_results_per_channel)
return ch, papers, {"total": len(papers)}
except Exception as exc: # noqa: BLE001
logger.warning("Channel %s failed: %s", ch, exc)
return ch, [], {"error": str(exc)}

tasks = [fetch_channel(ch) for ch in channels]
results = await asyncio.gather(*tasks, return_exceptions=True)

aggregator = ResultAggregator()
channel_stats: dict[str, dict[str, int | str]] = {}

for result in results:
if isinstance(result, Exception):
logger.error("Channel task failed: %s", result)
continue
ch, papers, meta = result
channel_stats[ch] = {"total": 0, "new": 0, "duplicates": 0}
if "error" in meta:
channel_stats[ch]["error"] = meta["error"]
else:
channel_stats[ch]["total"] = meta.get("total", 0)
aggregator.add_results(ch, papers, meta)

aggregated = aggregator.get_sorted_results()

return {
"papers": [
{
"id": f"temp-{i}",
"title": r.paper.title,
"authors": r.paper.metadata.get("authors", []),
"year": r.paper.publication_date.year if r.paper.publication_date else None,
"venue": r.paper.metadata.get("venue"),
"abstract": r.paper.abstract,
"sources": r.sources,
}
for i, r in enumerate(aggregated)
],
"channel_stats": channel_stats,
}


@router.get("/papers/suggest-channels")
def suggest_channels(query: str) -> dict:
"""根据关键词推荐合适的渠道"""
from packages.integrations.registry import ChannelRegistry
from packages.worker.smart_router import suggest_channels as get_suggestion

ChannelRegistry.register_default_channels()
available = ChannelRegistry.list_channels()

recommended, alternatives, reasoning = get_suggestion(query, available)

return {
"recommended": recommended,
"alternatives": alternatives,
"reasoning": reasoning,
}


@router.get("/papers/proxy-arxiv-pdf/{arxiv_id:path}")
async def proxy_arxiv_pdf(arxiv_id: str):
"""代理访问 arXiv PDF(解决 CORS 问题)"""
Expand Down Expand Up @@ -119,10 +203,10 @@ async def proxy_arxiv_pdf(arxiv_id: str):
"Cache-Control": "public, max-age=3600",
},
)
except httpx.TimeoutException:
raise HTTPException(status_code=504, detail="arXiv 请求超时")
except httpx.TimeoutException as err:
raise HTTPException(status_code=504, detail="arXiv 请求超时") from err
except httpx.RequestError as exc:
raise HTTPException(status_code=500, detail=f"arXiv 访问失败:{str(exc)}")
raise HTTPException(status_code=500, detail=f"arXiv 访问失败:{str(exc)}") from exc


@router.get("/papers/{paper_id}")
Expand Down Expand Up @@ -429,3 +513,73 @@ def paper_reasoning(paper_id: UUID) -> dict:
except ValueError as exc:
raise HTTPException(status_code=404, detail=str(exc)) from exc
return ReasoningService().analyze(paper_id)


# ========== IEEE 渠道专用路由(MVP 阶段新增)==========


@router.post("/papers/ingest/ieee")
def ingest_ieee_papers(
query: str = Query(..., min_length=1, max_length=500, description="IEEE 搜索关键词"),
max_results: int = Query(default=20, ge=1, le=100, description="最大结果数"),
topic_id: str | None = Query(default=None, description="可选的主题 ID"),
) -> dict:
"""
【MVP】IEEE 论文摄取接口

注意:
- 需要 IEEE API Key 配置(.env 中设置 IEEE_API_KEY)
- 手动触发,不影响现有 ArXiv 流程
- IEEE PDF 暂不支持下载

Args:
query: IEEE 搜索关键词
max_results: 最大结果数(默认 20)
topic_id: 可选的主题 ID

Returns:
dict: {status, total_fetched, inserted_ids, new_count}

示例:
```bash
curl -X POST "http://localhost:8002/papers/ingest/ieee?query=deep+learning&max_results=10"
```
"""
import logging

from packages.ai.pipelines import PaperPipelines
from packages.domain.enums import ActionType

logger = logging.getLogger(__name__)
pipelines = PaperPipelines()

try:
total, inserted_ids, new_count = pipelines.ingest_ieee(
query=query,
max_results=max_results,
topic_id=topic_id,
action_type=ActionType.manual_collect,
)

return {
"status": "success",
"total_fetched": total,
"inserted_ids": inserted_ids,
"new_count": new_count,
"message": f"✅ IEEE 摄取完成:{new_count} 篇新论文",
}

except RuntimeError as exc:
# IEEE API Key 未配置
logger.error("IEEE 摄取失败:%s", exc)
raise HTTPException(
status_code=503,
detail=f"IEEE 服务不可用:{str(exc)}。请在 .env 中设置 IEEE_API_KEY 环境变量。",
) from exc

except Exception as exc:
logger.error("IEEE 摄取失败:%s", exc)
raise HTTPException(
status_code=500,
detail=f"IEEE 摄取失败:{str(exc)}",
) from exc
Loading