- Scheduler - Interface in us.codecraft.webmagic.scheduler
-
Scheduler is the part of url management.
You can implement interface Scheduler to do:
manage urls to fetch
remove duplicate urls
- scheduler - Variable in class us.codecraft.webmagic.Spider
-
- scheduler(Scheduler) - Method in class us.codecraft.webmagic.Spider
-
Deprecated.
- select(Selector, List<String>) - Method in class us.codecraft.webmagic.selector.AbstractSelectable
-
- select(Selector) - Method in class us.codecraft.webmagic.selector.AbstractSelectable
-
- select(String) - Method in class us.codecraft.webmagic.selector.AndSelector
-
- select(String) - Method in class us.codecraft.webmagic.selector.BaseElementSelector
-
- select(Element) - Method in class us.codecraft.webmagic.selector.CssSelector
-
- select(Element) - Method in interface us.codecraft.webmagic.selector.ElementSelector
-
Extract single result in text.
If there are more than one result, only the first will be chosen.
- select(Selector) - Method in class us.codecraft.webmagic.selector.HtmlNode
-
- select(String) - Method in class us.codecraft.webmagic.selector.JsonPathSelector
-
- select(Element) - Method in class us.codecraft.webmagic.selector.LinksSelector
-
- select(String) - Method in class us.codecraft.webmagic.selector.OrSelector
-
- select(String) - Method in class us.codecraft.webmagic.selector.RegexSelector
-
- select(String) - Method in class us.codecraft.webmagic.selector.ReplaceSelector
-
- select(Selector) - Method in interface us.codecraft.webmagic.selector.Selectable
-
extract by custom selector
- select(String) - Method in interface us.codecraft.webmagic.selector.Selector
-
Extract single result in text.
If there are more than one result, only the first will be chosen.
- select(String) - Method in class us.codecraft.webmagic.selector.SmartContentSelector
-
- select(Element) - Method in class us.codecraft.webmagic.selector.XpathSelector
-
- Selectable - Interface in us.codecraft.webmagic.selector
-
Selectable text.
- selectDocument(Selector) - Method in class us.codecraft.webmagic.selector.Html
-
- selectDocumentForList(Selector) - Method in class us.codecraft.webmagic.selector.Html
-
- selectElement(String) - Method in class us.codecraft.webmagic.selector.BaseElementSelector
-
- selectElement(Element) - Method in class us.codecraft.webmagic.selector.BaseElementSelector
-
- selectElement(Element) - Method in class us.codecraft.webmagic.selector.CssSelector
-
- selectElement(Element) - Method in class us.codecraft.webmagic.selector.LinksSelector
-
- selectElement(Element) - Method in class us.codecraft.webmagic.selector.XpathSelector
-
- selectElements(String) - Method in class us.codecraft.webmagic.selector.BaseElementSelector
-
- selectElements(Element) - Method in class us.codecraft.webmagic.selector.BaseElementSelector
-
- selectElements(Element) - Method in class us.codecraft.webmagic.selector.CssSelector
-
- selectElements(BaseElementSelector) - Method in class us.codecraft.webmagic.selector.HtmlNode
-
select elements
- selectElements(Element) - Method in class us.codecraft.webmagic.selector.LinksSelector
-
- selectElements(Element) - Method in class us.codecraft.webmagic.selector.XpathSelector
-
- selectGroup(String) - Method in class us.codecraft.webmagic.selector.RegexSelector
-
- selectGroupList(String) - Method in class us.codecraft.webmagic.selector.RegexSelector
-
- selectList(Selector, List<String>) - Method in class us.codecraft.webmagic.selector.AbstractSelectable
-
- selectList(Selector) - Method in class us.codecraft.webmagic.selector.AbstractSelectable
-
- selectList(String) - Method in class us.codecraft.webmagic.selector.AndSelector
-
- selectList(String) - Method in class us.codecraft.webmagic.selector.BaseElementSelector
-
- selectList(Element) - Method in class us.codecraft.webmagic.selector.CssSelector
-
- selectList(Element) - Method in interface us.codecraft.webmagic.selector.ElementSelector
-
Extract all results in text.
- selectList(Selector) - Method in class us.codecraft.webmagic.selector.HtmlNode
-
- selectList(String) - Method in class us.codecraft.webmagic.selector.JsonPathSelector
-
- selectList(Element) - Method in class us.codecraft.webmagic.selector.LinksSelector
-
- selectList(String) - Method in class us.codecraft.webmagic.selector.OrSelector
-
- selectList(String) - Method in class us.codecraft.webmagic.selector.RegexSelector
-
- selectList(String) - Method in class us.codecraft.webmagic.selector.ReplaceSelector
-
- selectList(Selector) - Method in interface us.codecraft.webmagic.selector.Selectable
-
extract by custom selector
- selectList(String) - Method in interface us.codecraft.webmagic.selector.Selector
-
Extract all results in text.
- selectList(String) - Method in class us.codecraft.webmagic.selector.SmartContentSelector
-
- selectList(Element) - Method in class us.codecraft.webmagic.selector.XpathSelector
-
- Selector - Interface in us.codecraft.webmagic.selector
-
Selector(extractor) for text.
- Selectors - Class in us.codecraft.webmagic.selector
-
Convenient methods for selectors.
- Selectors() - Constructor for class us.codecraft.webmagic.selector.Selectors
-
- setAcceptStatCode(Set<Integer>) - Method in class us.codecraft.webmagic.Site
-
Set acceptStatCode.
When status code of http response is in acceptStatCodes, it will be processed.
{200} by default.
It is not necessarily to be set.
- setBinaryContent(boolean) - Method in class us.codecraft.webmagic.Request
-
- setBody(byte[]) - Method in class us.codecraft.webmagic.model.HttpRequestBody
-
- setBytes(byte[]) - Method in class us.codecraft.webmagic.Page
-
- setCharset(String) - Method in class us.codecraft.webmagic.Page
-
- setCharset(String) - Method in class us.codecraft.webmagic.Request
-
- setCharset(String) - Method in class us.codecraft.webmagic.Site
-
Set charset of page manually.
When charset is not set or set to null, it can be auto detected by Http header.
- setContentType(String) - Method in class us.codecraft.webmagic.model.HttpRequestBody
-
- setCycleRetryTimes(int) - Method in class us.codecraft.webmagic.Site
-
Set cycleRetryTimes times when download fail, 0 by default.
- setDisableCookieManagement(boolean) - Method in class us.codecraft.webmagic.Site
-
Downloader is supposed to store response cookie.
- setDomain(String) - Method in class us.codecraft.webmagic.Site
-
set the domain of site.
- setDownloader(Downloader) - Method in class us.codecraft.webmagic.Spider
-
set the downloader of spider
- setDownloadSuccess(boolean) - Method in class us.codecraft.webmagic.Page
-
- setDuplicateRemover(DuplicateRemover) - Method in class us.codecraft.webmagic.scheduler.DuplicateRemovedScheduler
-
- setEmptySleepTime(int) - Method in class us.codecraft.webmagic.Spider
-
Set wait time when no url is polled.
- setEncoding(String) - Method in class us.codecraft.webmagic.model.HttpRequestBody
-
- setExecutorService(ExecutorService) - Method in class us.codecraft.webmagic.Spider
-
- setExecutorService(ExecutorService) - Method in class us.codecraft.webmagic.thread.CountableThreadPool
-
- setExitWhenComplete(boolean) - Method in class us.codecraft.webmagic.Spider
-
Exit when complete.
- setExtras(Map<String, Object>) - Method in class us.codecraft.webmagic.Request
-
- setHeaders(Map<String, List<String>>) - Method in class us.codecraft.webmagic.Page
-
- setHtml(Html) - Method in class us.codecraft.webmagic.Page
-
- setHttpClientContext(HttpClientContext) - Method in class us.codecraft.webmagic.downloader.HttpClientRequestContext
-
- setHttpUriRequest(HttpUriRequest) - Method in class us.codecraft.webmagic.downloader.HttpClientRequestContext
-
- setHttpUriRequestConverter(HttpUriRequestConverter) - Method in class us.codecraft.webmagic.downloader.HttpClientDownloader
-
- setMethod(String) - Method in class us.codecraft.webmagic.Request
-
- setPath(String) - Method in class us.codecraft.webmagic.utils.FilePersistentBase
-
- setPipelines(List<Pipeline>) - Method in class us.codecraft.webmagic.Spider
-
set pipelines for Spider
- setPoolSize(int) - Method in class us.codecraft.webmagic.downloader.HttpClientGenerator
-
- setPriority(long) - Method in class us.codecraft.webmagic.Request
-
Set the priority of request for sorting.
Need a scheduler supporting priority.
- setProxyProvider(ProxyProvider) - Method in class us.codecraft.webmagic.downloader.HttpClientDownloader
-
- setRawText(String) - Method in class us.codecraft.webmagic.Page
-
- setRequest(Request) - Method in class us.codecraft.webmagic.Page
-
- setRequest(Request) - Method in class us.codecraft.webmagic.ResultItems
-
- setRequestBody(HttpRequestBody) - Method in class us.codecraft.webmagic.Request
-
- setRetrySleepTime(int) - Method in class us.codecraft.webmagic.Site
-
Set retry sleep times when download fail, 1000 by default.
- setRetryTimes(int) - Method in class us.codecraft.webmagic.Site
-
Set retry times when download fail, 0 by default.
- setScheduler(Scheduler) - Method in class us.codecraft.webmagic.Spider
-
set scheduler for Spider
- setSkip(boolean) - Method in class us.codecraft.webmagic.Page
-
- setSkip(boolean) - Method in class us.codecraft.webmagic.ResultItems
-
Set whether to skip the result.
Result which is skipped will not be processed by Pipeline.
- setSleepTime(int) - Method in class us.codecraft.webmagic.Site
-
Set the interval between the processing of two pages.
Time unit is micro seconds.
- setSpawnUrl(boolean) - Method in class us.codecraft.webmagic.Spider
-
Whether add urls extracted to download.
Add urls to download when it is true, and just download seed urls when it is false.
- setSpiderListeners(List<SpiderListener>) - Method in class us.codecraft.webmagic.Spider
-
- setStatusCode(int) - Method in class us.codecraft.webmagic.Page
-
- setThread(int) - Method in interface us.codecraft.webmagic.downloader.Downloader
-
Tell the downloader how many threads the spider used.
- setThread(int) - Method in class us.codecraft.webmagic.downloader.HttpClientDownloader
-
- setTimeOut(int) - Method in class us.codecraft.webmagic.Site
-
set timeout for downloader in ms
- setUrl(Selectable) - Method in class us.codecraft.webmagic.Page
-
- setUrl(String) - Method in class us.codecraft.webmagic.Request
-
- setUseGzip(boolean) - Method in class us.codecraft.webmagic.Site
-
Whether use gzip.
- setUserAgent(String) - Method in class us.codecraft.webmagic.Site
-
set user agent
- setUUID(String) - Method in class us.codecraft.webmagic.Spider
-
Set an uuid for spider.
Default uuid is domain of site.
- shouldReserved(Request) - Method in class us.codecraft.webmagic.scheduler.DuplicateRemovedScheduler
-
- shutdown() - Method in class us.codecraft.webmagic.thread.CountableThreadPool
-
- SimplePageProcessor - Class in us.codecraft.webmagic.processor
-
A simple PageProcessor.
- SimplePageProcessor(String) - Constructor for class us.codecraft.webmagic.processor.SimplePageProcessor
-
- SimpleProxyProvider - Class in us.codecraft.webmagic.proxy
-
A simple ProxyProvider.
- SimpleProxyProvider(List<Proxy>) - Constructor for class us.codecraft.webmagic.proxy.SimpleProxyProvider
-
- Site - Class in us.codecraft.webmagic
-
Object contains setting for crawler.
- Site() - Constructor for class us.codecraft.webmagic.Site
-
- site - Variable in class us.codecraft.webmagic.Spider
-
- sleep(int) - Method in class us.codecraft.webmagic.Spider
-
- smartContent() - Method in class us.codecraft.webmagic.selector.HtmlNode
-
- smartContent() - Method in class us.codecraft.webmagic.selector.PlainText
-
- smartContent() - Method in interface us.codecraft.webmagic.selector.Selectable
-
select smart content with ReadAbility algorithm
- smartContent() - Static method in class us.codecraft.webmagic.selector.Selectors
-
- SmartContentSelector - Class in us.codecraft.webmagic.selector
-
Borrowed from https://code.google.com/p/cx-extractor/
- SmartContentSelector() - Constructor for class us.codecraft.webmagic.selector.SmartContentSelector
-
- sourceTexts - Variable in class us.codecraft.webmagic.selector.PlainText
-
- spawnUrl - Variable in class us.codecraft.webmagic.Spider
-
- Spider - Class in us.codecraft.webmagic
-
Entrance of a crawler.
A spider contains four modules: Downloader, Scheduler, PageProcessor and
Pipeline.
Every module is a field of Spider.
- Spider(PageProcessor) - Constructor for class us.codecraft.webmagic.Spider
-
create a spider with pageProcessor.
- Spider.Status - Enum in us.codecraft.webmagic
-
- SpiderListener - Interface in us.codecraft.webmagic
-
Listener of Spider on page processing.
- start() - Method in class us.codecraft.webmagic.Spider
-
- startRequest(List<Request>) - Method in class us.codecraft.webmagic.Spider
-
Set startUrls of Spider.
Prior to startUrls of Site.
- startRequests - Variable in class us.codecraft.webmagic.Spider
-
- startUrls(List<String>) - Method in class us.codecraft.webmagic.Spider
-
Set startUrls of Spider.
Prior to startUrls of Site.
- stat - Variable in class us.codecraft.webmagic.Spider
-
- STAT_INIT - Static variable in class us.codecraft.webmagic.Spider
-
- STAT_RUNNING - Static variable in class us.codecraft.webmagic.Spider
-
- STAT_STOPPED - Static variable in class us.codecraft.webmagic.Spider
-
- StatusCode() - Constructor for class us.codecraft.webmagic.utils.HttpConstant.StatusCode
-
- stop() - Method in class us.codecraft.webmagic.Spider
-