GAE Python中的 Cron Job 失败

原创

华科云商小徐

发布于 2024-12-02 11:45:26

7000

代码可运行

文章被收录于专栏：小徐学爬虫小徐学爬虫

运行总次数：0

代码可运行

在 Google App Engine (GAE) 上，Python 应用中的 Cron Job 失败可能有多种原因。以下是排查和解决 GAE Cron Job 失败的详细步骤：

1. 问题背景

在 Google Appengine 中，有一个使用 cron.yaml 每 20 分钟执行一次的脚本。该脚本在本地和手动访问时都能正常运行，但当 cron.yaml 负责启动它时，它总是无法在线完成。

日志中没有显示任何错误，只有 2 条调试信息：

D 2013-07-23 06:00:08.449
type(soup): <class 'bs4.BeautifulSoup'> END type(soup)

D 2013-07-23 06:00:11.246
type(soup): <class 'bs4.BeautifulSoup'> END type(soup)

2. 解决方案

2.1 分析问题

我们注意到该脚本中有两个嵌套的 for 循环，这可能会导致问题。当 cron job 运行时，它将在 App Engine 实例上执行。这些实例是短暂的，可能会在脚本完成运行之前终止。

当实例终止时，它正在运行的任何任务都将被终止，包括正在执行的 cron job。这会导致脚本无法完成运行，并导致日志中出现失败消息。

2.2 优化解决方案

为了解决这个问题，我们需要一种方法来确保脚本在实例终止之前完成运行。一种方法是使用 Cloud Tasks 来计划任务。Cloud Tasks 是一个完全托管的服务，可让您在 App Engine 实例上安排和管理任务。

2.3 使用 Cloud Tasks

以下是如何使用 Cloud Tasks 来计划脚本任务：

在 app.yaml 文件中，添加以下代码：

taskqueue:
- name: scrape-task
  rate: 20min
  url: /scrape

在你的脚本中，添加以下代码：

def scrape():
    taskqueue.add(url='/scrape', method='GET')

部署你的应用程序。

现在，当 cron job 运行时，它将安排一个任务来执行你的脚本。任务将在 App Engine 实例上运行，并在实例终止之前完成。

2.4 代码示例

以下是修改后的 scrape.py 脚本：

import jinja2, webapp2, urllib2, re

from bs4 import BeautifulSoup as bs
from google.appengine.api import memcache, taskqueue
from google.appengine.ext import db

class Article(db.Model):
    content = db.TextProperty()
    datetime = db.DateTimeProperty(auto_now_add=True)
    companies = db.ListProperty(db.Key)
    url = db.StringProperty()

class Company(db.Model):
    name = db.StringProperty()
    ticker = db.StringProperty()

    @property
    def articles(self):
        return Article.gql("WHERE companies = :1", self.key())

def companies_key(companies_name=None):
    return db.Key.from_path('Companies', companies_name or 'default_companies')

def articles_key(articles_name=None):
    return db.Key.from_path('Articles', articles_name or 'default_articles')

def scrape():
    taskqueue.add(url='/scrape', method='GET')

def fetch(link):
    try:
        html = urllib2.urlopen(url).read()
        soup = bs(html)
    except:
        return "None"
    text = soup.get_text()
    text = text.encode('utf-8')
    text = text.decode('utf-8')
    text = unicode(text)
    if text is not "None":
        return text
    else:
        return "None"

def links(ticker):
    url = "https://www.google.com/finance/company_news?q=NASDAQ:" + ticker + "&start=10&num=10"
    html = urllib2.urlopen(url).read()
    soup = bs(html)
    div_class = re.compile("^g-section.*")
    divs = soup.find_all("div", {"class" : div_class})
    links = []
    for div in divs:
        a = unicode(div.find('a', attrs={'href': re.compile("^http://")}))
        link_regex = re.search("(http://.*?)\"",a)
        try:
            link = link_regex.group(1)
            soup = bs(link)
            link = soup.get_text()
        except:
            link = "None"
        links.append(link)

    return links

...and the script's handler in main:
class ScrapeHandler(webapp2.RequestHandler):
    def get(self):
        scrape.scrape()
        self.redirect("/")

2.5 其他注意事项

除了使用 Cloud Tasks 之外，还有一些其他方法可以解决这个问题。例如，你可以使用 Cloud Scheduler 来计划任务，或者你可以使用 Cloud Run 来创建无服务器函数。

你应该选择最适合你应用程序的方法。

通过以上步骤，通常可以快速解决 GAE 上 Cron Job 的问题。

原创声明：本文系作者授权腾讯云开发者社区发表，未经许可，不得转载。

如有侵权，请联系 cloudcommunity@tencent.com 删除。

python

原创声明：本文系作者授权腾讯云开发者社区发表，未经许可，不得转载。

如有侵权，请联系 cloudcommunity@tencent.com 删除。

python

#Python

登录后参与评论

0 条评论

热度