《信息内容安全》课程设计——搜索引擎
《信息内容安全》网络信息内容获取技术课程项目设计
平台:全平台
jdk 1.8.0
ElasticSearch 7.4.0
Python 3.6 及以上
> pip install paddlepaddle numpy elasticsearch
> pip install requests bs4
https://www.elastic.co/cn/downloads/elasticsearch 并解压Elasticsearch,详细步骤自行搜索
https://github.com/medcl/elasticsearch-analysis-ik/releases IK 中文分词器,详细步骤自行搜索
创建索引
PUT http://127.0.0.1/page
{
"settings": {
"number_of_shards": "5",
"number_of_replicas": "0"
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "ik_max_word"
},
"weight": {
"type": "double"
},
"content" : {
"type" : "text",
"analyzer": "ik_max_word"
},
"content_type": {
"type": "text"
},
"url": {
"type": "text",
"analyzer": "ik_max_word"
},
"update_date": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
}
}
}
}
启动 ElasticSearch ,在 bash 中执行 bin/elasticsearch
或者在 Windows 的 cmd、powershell 执行 bin\elasticsearch.bat
> cd WebApp
> java -jar *.jar
> cd DataCrawler
> python crawler.py
> cd DataProcess
> python PageRank.py
> cd DataProcess/Text_Classification
> python Classify.py
访问http://127.0.0.1:80