Robots.txt - Everything SEOs Need to Know - Deepcrawl
文章推薦指數: 80 %
Blocking URL parameters. You can use robots.txt to block URLs containing specific parameters, but this isn't always the best course of action. It is better ... Platform TechnicalSEOPlatform ThefullDeepcrawlexperienceFindoutmore SEOAnalyticsHub DetectgrowthopportunitieswithtechnicalSEOinsightsFindoutmore SEOMonitorHub Monitorsitechanges&techSEOtrendsFindoutmore SEOAutomationHub ProtectyoursitewithcodetestingautomationFindoutmore Integrations ConnectyourSEOdatawithbusinessdataFindoutmore ProfessionalServices GetstrategichelpfromtechSEOexpertsFindoutmore The#1WebsiteCrawler Ultra-fastcrawlspeeds,maximumflexibilityFindoutmore Pricing Resources Ebooks Webinars CaseStudies TechnicalSEOFundamentals Onboarding ProductGuides SEOOfficeHoursLibrary API Blog About About Clients Careers ContactUs Partners FAQs Login AnalyticsHub AutomationHub Getstarted Login AnalyticsHub AutomationHub Robots.txt RachelCostello 26minread Home>Knowledge>TechnicalSEOLibrary Inthissectionofourguidetorobotsdirectives,we’llgointomoredetailabouttherobots.txttextfileandhowitcanbeusedtoinstructthesearchenginewebcrawlers.Thisfileisespeciallyusefulformanagingcrawlbudgetandmakingsuresearchenginesarespendingtheirtimeonyoursiteefficientlyandcrawlingonlytheimportantpages. Whatisarobotstxtfileusedfor? Therobots.txtfileistheretotellcrawlersandrobotswhichURLstheyshouldnotvisitonyourwebsite.Thisisimportanttohelpthemavoidcrawlinglowqualitypages,orgettingstuckincrawltrapswhereaninfinitenumberofURLscouldpotentiallybecreated,forexample,acalendarsectionwhichcreatesanewURLforeveryday. AsGoogleexplainsintheirrobots.txtspecificationsguide,thefileformatshouldbeplaintextencodedinUTF-8.Thefile’srecords(orlines)shouldbeseparatedbyCR,CR/LForLF. Youshouldbemindfulofthesizeofarobots.txtfile,assearchengineshavetheirownmaximumfilesizelimits.ThemaximumsizeforGoogleis500KB. Whereshouldtherobots.txtexist? Therobots.txtshouldalwaysexistontherootofthedomain,forexample: Thisfileisspecifictotheprotocolandfulldomain,sotherobots.txtonhttps://www.example.comdoesnotimpactthecrawlingofhttps://www.example.comorhttps://subdomain.example.com;theseshouldhavetheirownrobots.txtfiles. Whenshouldyouuserobots.txtrules? Ingeneral,websitesshouldtrytoutilisetherobots.txtaslittleaspossibletocontrolcrawling.Improvingyourwebsite’sarchitectureandmakingitcleanandaccessibleforcrawlersisamuchbettersolution.However,usingrobots.txtwherenecessarytopreventcrawlersfromaccessinglowqualitysectionsofthesiteisrecommendediftheseproblemscannotbefixedintheshortterm. Googlerecommendsonlyusingrobots.txtwhenserverissuesarebeingcausedorforcrawlefficiencyissues,suchasGooglebotspendingalotoftimecrawlinganon-indexablesectionofasite,forexample. Someexamplesofpageswhichyoumaynotwanttobecrawledare: Categorypageswithnon-standardsortingasthisgenerallycreatesduplicationwiththeprimarycategorypage User-generatedcontentthatcannotbemoderated Pageswithsensitiveinformation Internalsearchpagesastherecanbeaninfiniteamountoftheseresultpageswhichprovidesapooruserexperienceandwastescrawlbudget Whenshouldn’tyouuserobots.txt? Therobots.txtfileisausefultoolwhenusedcorrectly,however,thereareinstanceswhereitisn’tthebestsolution.Herearesomeexamplesofwhennottouserobots.txttocontrolcrawling: 1.BlockingJavascript/CSS Searchenginesneedtobeabletoaccessallresourcesonyoursitetocorrectlyrenderpages,whichisanecessarypartofmaintaininggoodrankings.JavaScriptfileswhichdramaticallychangetheuserexperience,butaredisallowedfromcrawlingbysearchenginesmayresultinmanualoralgorithmicpenalties. Forinstance,ifyouserveanadinterstitialorredirectuserswithJavaScriptthatasearchenginecannotaccess,thismaybeseenascloakingandtherankingsofyourcontentmaybeadjustedaccordingly. 2.BlockingURLparameters Youcanuserobots.txttoblockURLscontainingspecificparameters,butthisisn’talwaysthebestcourseofaction.ItisbettertohandletheseinGoogleSearchconsoleastherearemoreparameter-specificoptionstheretocommunicatepreferredcrawlingmethodstoGoogle. YoucouldalsoplacetheinformationinaURLfragment(/page#sort=price),assearchenginesdonotcrawlthis.Additionally,ifaURLparametermustbeused,thelinkstoitcouldcontaintherel=nofollowattributetopreventcrawlersfromtryingtoaccessit. 3.BlockingURLswithbacklinks DisallowingURLswithintherobots.txtpreventslinkequityfrompassingthroughtothewebsite.ThismeansthatifsearchenginesareunabletofollowlinksfromotherwebsitesasthetargetURLisdisallowed,yourwebsitewillnotgaintheauthoritythatthoselinksarepassingandasaresult,youmaynotrankaswelloverall. 4.Gettingindexedpagesdeindexed UsingDisallowdoesn’tgetpagesdeindexed,andeveniftheURLisblockedandsearchengineshavenevercrawledthepage,disallowedpagesmaystillgetindexed.Thisisbecausethecrawlingandindexingprocessesarelargelyseparate. 5.Settingruleswhichignoresocialnetworkcrawlers Evenifyoudon’twantsearchenginestocrawlandindexpages,youmaywantsocialnetworkstobeabletoaccessthosepagessothatapagesnippetcanbebuilt.Forexample,Facebookwillattempttovisiteverypagethatgetspostedonthenetwork,sothattheycanservearelevantsnippet.Keepthisinmindwhensettingrobots.txtrules. 6.Blockingaccessfromstagingordevsites Usingtherobots.txttoblockanentirestagingsiteisn’tbestpractice.Googlerecommendsnoindexingthepagesbutallowingthemtobecrawled,butingeneralitisbettertorenderthesiteinaccessiblefromtheoutsideworld. 7.Whenyouhavenothingtoblock Somewebsiteswithaverycleanarchitecturehavenoneedtoblockcrawlersfromanypages.Inthissituationit’sperfectlyacceptablenottohavearobots.txtfile,andreturna404statuswhenit’srequested. Robots.txtSyntaxandFormatting Nowthatwe’velearntwhatrobots.txtisandwhenitshouldandshouldn’tbeused,let’stakealookatthestandardisedsyntaxandformattingrulesthatshouldbeadheredtowhenwritingarobots.txtfile. Comments Commentsarelinesthatarecompletelyignoredbysearchengines andstartwitha#.Theyexisttoallowyoutowritenotesaboutwhateachlineofyourrobots.txtdoes,whyitexists,andwhenitwasadded.Ingeneral,itisadvisedtodocumentthepurposeofeverylineofyourrobots.txtfile,sothatitcanberemovedwhenitisnolongernecessaryandisnotmodifiedwhileitisstillessential. SpecifyingUser-agent Ablockofrulescanbeappliedtospecificuseragentsusingthe“User-agent”directive.Forinstance,ifyouwantedcertainrulestoapplytoGoogle,Bing,andYandex;butnotFacebookandadnetworks,thiscanbeachievedbyspecifyingauseragenttokenthatasetofrulesappliesto. Eachcrawlerhasitsownuser-agenttoken,whichisusedtoselectthematchingblocks. Crawlerswillfollowthemostspecificuseragentrulessetforthemwiththenameseparatedbyhyphens,andwillthenfallbacktomoregenericrulesifanexactmatchisn’tfound.Forexample,GooglebotNewswilllookforamatchof‘googlebot-news’,then‘googlebot’,then‘*’. Herearesomeofthemostcommonuseragenttokensyou’llcomeacross: *–Therulesapplytoeverybot,unlessthereisamorespecificsetofrules Googlebot–AllGooglecrawlers Googlebot-News–CrawlerforGoogleNews Googlebot-Image–CrawlerforGoogleImages Mediapartners-Google–GoogleAdsensecrawler Bingbot–Bing’scrawler Yandex–Yandex’scrawler Baiduspider–Baidu’scrawler Facebot–Facebook’scrawler Twitterbot–Twitter’scrawler Thislistofuseragenttokensisbynomeansexhaustive,sotolearnmoreaboutsomeofthecrawlersoutthere,takealookatthedocumentationpublishedbyGoogle,Bing,Yandex,Baidu,FacebookandTwitter. Thematchingofauseragenttokentoarobots.txtblockisnotcasesensitive.E.g.‘googlebot’willmatchGoogle’suseragenttoken‘Googlebot’. PatternMatchingURLs YoumighthaveaparticularURLstringyouwanttoblockfrombeingcrawled,asthisismuchmoreefficientthanincludingafulllistofcompleteURLstobeexcludedinyourrobots.txtfile. TohelpyourefineyourURLpaths,youcanusethe*and$symbols.Here’showtheywork: *–Thisisawildcardandrepresentsanyamountofanycharacter.ItcanbeatthestartorinthemiddleofaURLpath,butisn’trequiredattheend.YoucanusemultiplewildcardswithinaURLstring,forexample,“Disallow:*/products?*sort=”.Ruleswithfullpathsshouldnotstartwithawildcard. $–ThischaractersignifiestheendofaURLstring,so“Disallow:*/dress$”willmatchonlyURLsendingin“/dress”,andnot“/dress?parameter”. It’sworthnotingthatrobots.txtrulesarecasesensitive,meaningthatifyoudisallowURLswiththeparameter“search”(e.g.“Disallow:*?search=”),robotsmightstillcrawlURLswithdifferentcapitalisation,suchas“?Search=anything”. ThedirectiverulesmatchagainstURLpathsonly,andcan’tincludeaprotocolorhostname.AslashatthestartofadirectivematchesagainstthestartoftheURLpath.E.g.“Disallow:/starts”wouldmatchtowww.example.com/starts. Unlessyouaddastartadirectivematchwitha/or*,itwillnotmatchanything.E.g.“Disallow:starts”wouldnevermatchanything. TohelpvisualisethewaysdifferentURLsruleswork,we’veputtogethersomeexamplesforyou: Robots.txtSitemapLink Thesitemapdirectiveinarobots.txtfiletellssearchengineswheretofindtheXMLsitemap,whichhelpsthemtodiscoveralltheURLsonthewebsite.Tolearnmoreaboutsitemaps,takealookatourguideonsitemapauditsandadvancedconfiguration. Whenincludingsitemapsinarobots.txtfile,youshoulduseabsoluteURLs(i.e.https://www.example.com/sitemap.xml)insteadofrelativeURLs(i.e./sitemap.xml.)It’salsoworthnotingthatsitemapsdon’thavetositononerootdomain,theycanalsobehostedonanexternaldomain. Searchengineswilldiscoverandmaycrawlthesitemapslistedinyourrobots.txtfile,however,thesesitemapswillnotappearinGoogleSearchConsoleorBingWebmasterToolswithoutmanualsubmission. Robots.txtBlocks The“disallow”ruleintherobots.txtfilecanbeusedinanumber ofwaysfordifferentuseragents.Inthissection,we’llcoversomeofthedifferentwaysyoucanformatcombinationsofblocks. It’simportanttorememberthatdirectivesintherobots.txtfileareonlyinstructions.Maliciouscrawlerswillignoreyourrobots.txtfileandcrawlanypartofyoursitethatispublic,sodisallowshouldnotbeusedinplaceofrobustsecuritymeasures. MultipleUser-agentblocks Youcanmatchablockofrulestomultipleuseragentsbylistingthembeforeasetofrules,forexample,thefollowingdisallowruleswillapplytobothGooglebotandBinginthefollowingblockofrules: User-agent:googlebot User-agent:bing Disallow:/a Spacingbetweenblocksofdirectives Googlewillignorespacesbetweendirectivesandblocks.Inthisfirstexample,thesecondrulewillbepickedup,eventhoughthereisaspaceseparatingthetwopartsoftherule: [code] User-agent:* Disallow:/disallowed/ Disallow:/test1/robots_excluded_blank_line [/code] Inthissecondexample,Googlebot-mobilewouldinheritthesamerulesasBingbot: [code] User-agent:googlebot-mobile User-agent:bing Disallow:/test1/deepcrawl_excluded [/code] Separateblockscombined Multipleblockswiththesameuseragentarecombined.Sointheexamplebelow,thetopandbottomblockswouldbecombinedandGooglebotwouldbedisallowedfromcrawling“/b”and“/a”. User-agent:googlebot Disallow:/b User-agent:bing Disallow:/a User-agent:googlebot Disallow:/a Robots.txtAllow Therobots.txt“allow”ruleexplicitlygivespermissionforcertainURLstobecrawled.WhilethisisthedefaultforallURLs,thisrulecanbeusedtooverwriteadisallowrule.Forexample,if“/locations”isdisallowed,youcouldallowthecrawlingof“/locations/london”byhavingthespecificruleof“Allow:/locations/london”. Robots.txtPrioritisation WhenseveralallowanddisallowrulesapplytoaURL,thelongestmatchingruleistheonethatisapplied.Let’slookatwhatwouldhappenfortheURL“/home/search/shirts”withthefollowingrules: Disallow:/home Allow:*search/* Disallow:*shirts Inthiscase,theURLisallowedtobecrawledbecausetheAllowrulehas9characters,whereasthedisallowrulehasonly7.IfyouneedaspecificURLpathtobeallowedordisallowed,youcanutilise*tomakethestringlonger.Forexample: Disallow:*******************/shirts WhenaURLmatchesbothanallowruleandadisallowrule,buttherulesarethesamelength,thedisallowwillbefollowed.Forexample,theURL“/search/shirts”willbedisallowedinthefollowingscenario: Disallow:/search Allow:*shirts Robots.txtDirectives Pageleveldirectives(whichwe’llcoverlateroninthisguide)aregreattools,buttheissuewiththemisthatsearchenginesmustcrawlapagebeforebeingabletoreadtheseinstructions,whichcanconsumecrawlbudget. Robots.txtdirectivescanhelptoreducethestrainoncrawlbudgetbecauseyoucanadddirectivesdirectlyintoyourrobots.txtfileratherthanwaitingforsearchenginestocrawlpagesbeforetakingactiononthem.Thissolutionismuchquickerandeasiertomanage. Thefollowingrobots.txtdirectivesworkinthesamewayastheallowanddisallowdirectives,inthatyoucanspecifywildcards(*)andusethe$symboltodenotetheendofaURLstring. Robots.txtNoIndex Robots.txtnoindexisausefultoolformanagingsearchengineindexingwithoutusingupcrawlbudget.Disallowingapageinrobots.txtdoesn’tmeanitisremovedfromtheindex,sothenoindexdirectiveismuchmoreeffectivetouseforthispurpose. Googledoesn’tofficiallysupportrobots.txtnoindex,andyoushouldn’trelyonitbecausealthoughitworkstoday,itmaynotdosotomorrow.Thistoolcanbehelpfulthoughandshouldbeusedasashorttermfixincombinationwithotherlonger-termindexcontrols,butnotasamission-criticaldirective.TakealookatthetestsrunbyohgmandStoneTemplewhichbothprovethatthefeatureworkseffectively. Here’sanexampleofhowyouwoulduserobots.txtnoindex: [code] User-agent:* NoIndex:/directory NoIndex:/*?*sort= [/code] Aswellasnoindex,Googlecurrentlyunofficiallyobeysseveralotherindexingdirectiveswhenthey’replacedwithintherobots.txt.Itisimportanttonotethatnotallsearchenginesandcrawlerssupportthesedirectives,andtheoneswhichdomaystopsupportingthematanytime–youshouldn’trelyontheseworkingconsistently. CommonRobots.txtIssues Therearesomekeyissuesandconsiderationsfortherobots.txtfileandtheimpactitcanhaveonasite’sperformance.We’vetakenthetimetolistsomeofthekeypointstoconsiderwithrobots.txtaswellassomeofthemostcommonissueswhichyoucanhopefullyavoid. Haveafallbackblockofrulesforallbots–Usingblocksofrulesforspecificuseragentstringswithouthavingafallbackblockofrulesforeveryotherbotmeansthatyourwebsitewilleventuallyencounterabotwhichdoesnothaveanyrulesetstofollow. Itisimportantthatrobots.txtiskeptuptodate–Arelativelycommonproblemoccurswhentherobots.txtissetduringtheinitialdevelopmentphaseofawebsite,butisnotupdatedasthewebsitegrows,meaningthatpotentiallyusefulpagesaredisallowed. BeawareofredirectingsearchenginesthroughdisallowedURLs–Forexample,/product>/disallowed>/category Casesensitivitycancausealotofproblems–Webmastersmayexpectasectionofawebsitenottobecrawled,butthosepagesmaycrawledbecauseofalternatecasingsi.e.“Disallow:/admin”exists,butsearchenginescrawl“/ADMIN”. Don’tdisallowbacklinkedURLs–ThispreventsPageRankfromflowingtoyoursitefromothersthatarelinkingtoyou. CrawlDelaycancausesearchissues–The“crawl-delay”directiveforcescrawlerstovisityourwebsiteslowerthantheywouldhaveliked,meaningthatyourimportantpagesmaybecrawledlessoftenthanisoptimal.ThisdirectiveisnotobeyedbyGoogleorBaidu,butissupportedbyBingandYandex. Makesuretherobots.txtonlyreturnsa5xxstatuscodeifthewholesiteisdown–Returninga5xxstatuscodefor/robots.txtindicatestosearchenginesthatthewebsiteisdownformaintenance.Thistypicallymeansthattheywilltrytocrawlthewebsiteagainlater. Robots.txtdisallowoverridestheparameterremovaltool–Bemindfulthatyourrobots.txtrulesmayoverrideparameterhandlingandanyotherindexationhintsthatyoumayhavegiventosearchengines. SitelinksSearchBoxmarkupwillworkwithinternalsearchpagesblocked–InternalsearchpagesonasitedonotneedtobecrawlablefortheSitelinksSearchBoxmarkuptowork. Disallowingamigrateddomainwillimpactthesuccessofthemigration–Ifyoudisallowamigrateddomain,searchengineswon’tbeabletofollowanyoftheredirectsfromtheoldsitetothenewone,sothemigrationisunlikelytobeasuccess. Testing&AuditingRobots.txt Consideringjusthowharmfularobots.txtfilecanbeifthedirectiveswithinaren’thandledcorrectly,thereareafewdifferentwaysyoucantestittomakesureithasbeensetupproperly.TakealookatthisguideonhowtoauditURLsblockedbyrobots.txt,aswellastheseexamples: UseDeepCrawl–TheDisallowedPagesandDisallowedURLs(Uncrawled)reportscanshowyouwhichpagesarebeingblockedfromsearchenginesbyyourrobots.txtfile. UseGoogleSearchConsole–WiththeGSCrobots.txttestertoolyoucanseethelatestcachedversionofapage,aswellasusingtheFetchandRendertooltoseerendersfromtheGooglebotuseragentaswellasthebrowseruseragent.Thingstonote:GSConlyworksforGoogleUseragents,andonlysingleURLscanbetested. Trycombiningtheinsightsfrombothtoolsbyspot-checkingdisallowedURLsthatDeepCrawlhasflaggedwithintheGSCrobots.txttestertooltoclarifythespecificruleswhichareresultinginadisallow. MonitoringRobots.txtChanges Whentherearelotsofpeopleworkingonasite,andwiththeissuesthatcanbecausedifevenonecharacterisoutofplaceinarobots.txtfile,constantlymonitoringyourrobots.txtiscrucial.Herearesomewaysinwhichyoucancheckforanyissues: CheckGoogleSearchConsoletoseethecurrentrobots.txtwhichGoogleisusing.Sometimesrobots.txtcanbedeliveredconditionallybasedonuseragents,sothisistheonlymethodtoseeexactlywhatGoogleisseeing. Checkthesizeoftherobots.txtfileifyouhavenoticedsignificantchangestomakesureitisunderGoogle’s500KBsizelimit. GototheGoogleSearchConsoleIndexStatusreportinadvancedmodetocross-checkrobots.txtchangeswiththenumberofdisallowedandallowedURLsonyoursite. ScheduleregularcrawlswithDeepCrawltoseethenumberofdisallowedpagesonyoursiteonanongoingbasis,soyoucantrackchanges. Next:URL-levelRobotsDirectives Author RachelCostello RachelCostelloisaFormerTechnicalSEO&ContentManageratDeepcrawl.You'llmostoftenfindherwritingandspeakingaboutallthingsSEO. Contents Whatisarobotstxtfileusedfor? Whereshouldtherobots.txtexist? Whenshouldyouuserobots.txtrules? Whenshouldn’tyouuserobots.txt? Robots.txtSyntaxandFormatting Robots.txtBlocks Robots.txtAllow Robots.txtPrioritisation Robots.txtDirectives CommonRobots.txtIssues Testing&AuditingRobots.txt MonitoringRobots.txtChanges Navigation: 1 2 Chooseabetterwaytogrow Withtoolsthatwillhelpyourealizeyourwebsite’struepotential,andsupporttohelpyougetthere,growingyourenterprisebusinessonlinehasneverbeensosimple. BookaDemo Removecookies
延伸文章資訊
- 1Google: Do Not Use Robots.txt To Block Indexing Of URLs ...
Google's John Mueller said you should absolutely not "use robots.txt to block indexing of URLs wi...
- 2Ignore URLs in robot.txt with specific parameters?
Here's a solutions if you want to disallow query strings: Disallow: /*?*. or if you want to be mo...
- 3How would you disallow a dynamic URL parameter in robots.txt?
If you want to disallow subdomain URLs from being crawled and indexed by search engines, you can ...
- 4Should I disallow all URL query strings/parameters in Robots ...
Add "Disallow: *?" to hide all query/parameter URLs from Google. ... this when I added an exclusi...
- 5Preventing Duplicate Content with Query Parameters - Siteglide
Preventing Duplicate Content with Query Parameters- Canonical URL and Robots.txt ... Engines from...