Robots.txt - Everything SEOs Need to Know - Deepcrawl

文章推薦指數: 80 %
投票人數:10人

Blocking URL parameters. You can use robots.txt to block URLs containing specific parameters, but this isn't always the best course of action. It is better ... Platform TechnicalSEOPlatform ThefullDeepcrawlexperienceFindoutmore SEOAnalyticsHub DetectgrowthopportunitieswithtechnicalSEOinsightsFindoutmore SEOMonitorHub Monitorsitechanges&techSEOtrendsFindoutmore SEOAutomationHub ProtectyoursitewithcodetestingautomationFindoutmore Integrations ConnectyourSEOdatawithbusinessdataFindoutmore ProfessionalServices GetstrategichelpfromtechSEOexpertsFindoutmore The#1WebsiteCrawler Ultra-fastcrawlspeeds,maximumflexibilityFindoutmore Pricing Resources Ebooks Webinars CaseStudies TechnicalSEOFundamentals Onboarding ProductGuides SEOOfficeHoursLibrary API Blog About About Clients Careers ContactUs Partners FAQs Login AnalyticsHub AutomationHub Getstarted Login AnalyticsHub AutomationHub Robots.txt RachelCostello 26minread Home>Knowledge>TechnicalSEOLibrary Inthissectionofourguidetorobotsdirectives,we’llgointomoredetailabouttherobots.txttextfileandhowitcanbeusedtoinstructthesearchenginewebcrawlers.Thisfileisespeciallyusefulformanagingcrawlbudgetandmakingsuresearchenginesarespendingtheirtimeonyoursiteefficientlyandcrawlingonlytheimportantpages.   Whatisarobotstxtfileusedfor? Therobots.txtfileistheretotellcrawlersandrobotswhichURLstheyshouldnotvisitonyourwebsite.Thisisimportanttohelpthemavoidcrawlinglowqualitypages,orgettingstuckincrawltrapswhereaninfinitenumberofURLscouldpotentiallybecreated,forexample,acalendarsectionwhichcreatesanewURLforeveryday. AsGoogleexplainsintheirrobots.txtspecificationsguide,thefileformatshouldbeplaintextencodedinUTF-8.Thefile’srecords(orlines)shouldbeseparatedbyCR,CR/LForLF. Youshouldbemindfulofthesizeofarobots.txtfile,assearchengineshavetheirownmaximumfilesizelimits.ThemaximumsizeforGoogleis500KB.   Whereshouldtherobots.txtexist? Therobots.txtshouldalwaysexistontherootofthedomain,forexample: Thisfileisspecifictotheprotocolandfulldomain,sotherobots.txtonhttps://www.example.comdoesnotimpactthecrawlingofhttps://www.example.comorhttps://subdomain.example.com;theseshouldhavetheirownrobots.txtfiles.   Whenshouldyouuserobots.txtrules? Ingeneral,websitesshouldtrytoutilisetherobots.txtaslittleaspossibletocontrolcrawling.Improvingyourwebsite’sarchitectureandmakingitcleanandaccessibleforcrawlersisamuchbettersolution.However,usingrobots.txtwherenecessarytopreventcrawlersfromaccessinglowqualitysectionsofthesiteisrecommendediftheseproblemscannotbefixedintheshortterm. Googlerecommendsonlyusingrobots.txtwhenserverissuesarebeingcausedorforcrawlefficiencyissues,suchasGooglebotspendingalotoftimecrawlinganon-indexablesectionofasite,forexample. Someexamplesofpageswhichyoumaynotwanttobecrawledare: Categorypageswithnon-standardsortingasthisgenerallycreatesduplicationwiththeprimarycategorypage User-generatedcontentthatcannotbemoderated Pageswithsensitiveinformation Internalsearchpagesastherecanbeaninfiniteamountoftheseresultpageswhichprovidesapooruserexperienceandwastescrawlbudget   Whenshouldn’tyouuserobots.txt? Therobots.txtfileisausefultoolwhenusedcorrectly,however,thereareinstanceswhereitisn’tthebestsolution.Herearesomeexamplesofwhennottouserobots.txttocontrolcrawling: 1.BlockingJavascript/CSS Searchenginesneedtobeabletoaccessallresourcesonyoursitetocorrectlyrenderpages,whichisanecessarypartofmaintaininggoodrankings.JavaScriptfileswhichdramaticallychangetheuserexperience,butaredisallowedfromcrawlingbysearchenginesmayresultinmanualoralgorithmicpenalties. Forinstance,ifyouserveanadinterstitialorredirectuserswithJavaScriptthatasearchenginecannotaccess,thismaybeseenascloakingandtherankingsofyourcontentmaybeadjustedaccordingly. 2.BlockingURLparameters Youcanuserobots.txttoblockURLscontainingspecificparameters,butthisisn’talwaysthebestcourseofaction.ItisbettertohandletheseinGoogleSearchconsoleastherearemoreparameter-specificoptionstheretocommunicatepreferredcrawlingmethodstoGoogle. YoucouldalsoplacetheinformationinaURLfragment(/page#sort=price),assearchenginesdonotcrawlthis.Additionally,ifaURLparametermustbeused,thelinkstoitcouldcontaintherel=nofollowattributetopreventcrawlersfromtryingtoaccessit. 3.BlockingURLswithbacklinks DisallowingURLswithintherobots.txtpreventslinkequityfrompassingthroughtothewebsite.ThismeansthatifsearchenginesareunabletofollowlinksfromotherwebsitesasthetargetURLisdisallowed,yourwebsitewillnotgaintheauthoritythatthoselinksarepassingandasaresult,youmaynotrankaswelloverall. 4.Gettingindexedpagesdeindexed UsingDisallowdoesn’tgetpagesdeindexed,andeveniftheURLisblockedandsearchengineshavenevercrawledthepage,disallowedpagesmaystillgetindexed.Thisisbecausethecrawlingandindexingprocessesarelargelyseparate. 5.Settingruleswhichignoresocialnetworkcrawlers Evenifyoudon’twantsearchenginestocrawlandindexpages,youmaywantsocialnetworkstobeabletoaccessthosepagessothatapagesnippetcanbebuilt.Forexample,Facebookwillattempttovisiteverypagethatgetspostedonthenetwork,sothattheycanservearelevantsnippet.Keepthisinmindwhensettingrobots.txtrules. 6.Blockingaccessfromstagingordevsites Usingtherobots.txttoblockanentirestagingsiteisn’tbestpractice.Googlerecommendsnoindexingthepagesbutallowingthemtobecrawled,butingeneralitisbettertorenderthesiteinaccessiblefromtheoutsideworld. 7.Whenyouhavenothingtoblock Somewebsiteswithaverycleanarchitecturehavenoneedtoblockcrawlersfromanypages.Inthissituationit’sperfectlyacceptablenottohavearobots.txtfile,andreturna404statuswhenit’srequested.   Robots.txtSyntaxandFormatting Nowthatwe’velearntwhatrobots.txtisandwhenitshouldandshouldn’tbeused,let’stakealookatthestandardisedsyntaxandformattingrulesthatshouldbeadheredtowhenwritingarobots.txtfile. Comments Commentsarelinesthatarecompletelyignoredbysearchengines andstartwitha#.Theyexisttoallowyoutowritenotesaboutwhateachlineofyourrobots.txtdoes,whyitexists,andwhenitwasadded.Ingeneral,itisadvisedtodocumentthepurposeofeverylineofyourrobots.txtfile,sothatitcanberemovedwhenitisnolongernecessaryandisnotmodifiedwhileitisstillessential. SpecifyingUser-agent Ablockofrulescanbeappliedtospecificuseragentsusingthe“User-agent”directive.Forinstance,ifyouwantedcertainrulestoapplytoGoogle,Bing,andYandex;butnotFacebookandadnetworks,thiscanbeachievedbyspecifyingauseragenttokenthatasetofrulesappliesto. Eachcrawlerhasitsownuser-agenttoken,whichisusedtoselectthematchingblocks. Crawlerswillfollowthemostspecificuseragentrulessetforthemwiththenameseparatedbyhyphens,andwillthenfallbacktomoregenericrulesifanexactmatchisn’tfound.Forexample,GooglebotNewswilllookforamatchof‘googlebot-news’,then‘googlebot’,then‘*’. Herearesomeofthemostcommonuseragenttokensyou’llcomeacross: *–Therulesapplytoeverybot,unlessthereisamorespecificsetofrules Googlebot–AllGooglecrawlers Googlebot-News–CrawlerforGoogleNews Googlebot-Image–CrawlerforGoogleImages Mediapartners-Google–GoogleAdsensecrawler Bingbot–Bing’scrawler Yandex–Yandex’scrawler Baiduspider–Baidu’scrawler Facebot–Facebook’scrawler Twitterbot–Twitter’scrawler Thislistofuseragenttokensisbynomeansexhaustive,sotolearnmoreaboutsomeofthecrawlersoutthere,takealookatthedocumentationpublishedbyGoogle,Bing,Yandex,Baidu,FacebookandTwitter. Thematchingofauseragenttokentoarobots.txtblockisnotcasesensitive.E.g.‘googlebot’willmatchGoogle’suseragenttoken‘Googlebot’. PatternMatchingURLs YoumighthaveaparticularURLstringyouwanttoblockfrombeingcrawled,asthisismuchmoreefficientthanincludingafulllistofcompleteURLstobeexcludedinyourrobots.txtfile. TohelpyourefineyourURLpaths,youcanusethe*and$symbols.Here’showtheywork: *–Thisisawildcardandrepresentsanyamountofanycharacter.ItcanbeatthestartorinthemiddleofaURLpath,butisn’trequiredattheend.YoucanusemultiplewildcardswithinaURLstring,forexample,“Disallow:*/products?*sort=”.Ruleswithfullpathsshouldnotstartwithawildcard. $–ThischaractersignifiestheendofaURLstring,so“Disallow:*/dress$”willmatchonlyURLsendingin“/dress”,andnot“/dress?parameter”. It’sworthnotingthatrobots.txtrulesarecasesensitive,meaningthatifyoudisallowURLswiththeparameter“search”(e.g.“Disallow:*?search=”),robotsmightstillcrawlURLswithdifferentcapitalisation,suchas“?Search=anything”. ThedirectiverulesmatchagainstURLpathsonly,andcan’tincludeaprotocolorhostname.AslashatthestartofadirectivematchesagainstthestartoftheURLpath.E.g.“Disallow:/starts”wouldmatchtowww.example.com/starts. Unlessyouaddastartadirectivematchwitha/or*,itwillnotmatchanything.E.g.“Disallow:starts”wouldnevermatchanything. TohelpvisualisethewaysdifferentURLsruleswork,we’veputtogethersomeexamplesforyou: Robots.txtSitemapLink Thesitemapdirectiveinarobots.txtfiletellssearchengineswheretofindtheXMLsitemap,whichhelpsthemtodiscoveralltheURLsonthewebsite.Tolearnmoreaboutsitemaps,takealookatourguideonsitemapauditsandadvancedconfiguration. Whenincludingsitemapsinarobots.txtfile,youshoulduseabsoluteURLs(i.e.https://www.example.com/sitemap.xml)insteadofrelativeURLs(i.e./sitemap.xml.)It’salsoworthnotingthatsitemapsdon’thavetositononerootdomain,theycanalsobehostedonanexternaldomain. Searchengineswilldiscoverandmaycrawlthesitemapslistedinyourrobots.txtfile,however,thesesitemapswillnotappearinGoogleSearchConsoleorBingWebmasterToolswithoutmanualsubmission.   Robots.txtBlocks The“disallow”ruleintherobots.txtfilecanbeusedinanumber ofwaysfordifferentuseragents.Inthissection,we’llcoversomeofthedifferentwaysyoucanformatcombinationsofblocks. It’simportanttorememberthatdirectivesintherobots.txtfileareonlyinstructions.Maliciouscrawlerswillignoreyourrobots.txtfileandcrawlanypartofyoursitethatispublic,sodisallowshouldnotbeusedinplaceofrobustsecuritymeasures. MultipleUser-agentblocks Youcanmatchablockofrulestomultipleuseragentsbylistingthembeforeasetofrules,forexample,thefollowingdisallowruleswillapplytobothGooglebotandBinginthefollowingblockofrules: User-agent:googlebot User-agent:bing Disallow:/a Spacingbetweenblocksofdirectives Googlewillignorespacesbetweendirectivesandblocks.Inthisfirstexample,thesecondrulewillbepickedup,eventhoughthereisaspaceseparatingthetwopartsoftherule: [code] User-agent:* Disallow:/disallowed/ Disallow:/test1/robots_excluded_blank_line [/code] Inthissecondexample,Googlebot-mobilewouldinheritthesamerulesasBingbot: [code] User-agent:googlebot-mobile User-agent:bing Disallow:/test1/deepcrawl_excluded [/code] Separateblockscombined Multipleblockswiththesameuseragentarecombined.Sointheexamplebelow,thetopandbottomblockswouldbecombinedandGooglebotwouldbedisallowedfromcrawling“/b”and“/a”. User-agent:googlebot Disallow:/b User-agent:bing Disallow:/a User-agent:googlebot Disallow:/a   Robots.txtAllow Therobots.txt“allow”ruleexplicitlygivespermissionforcertainURLstobecrawled.WhilethisisthedefaultforallURLs,thisrulecanbeusedtooverwriteadisallowrule.Forexample,if“/locations”isdisallowed,youcouldallowthecrawlingof“/locations/london”byhavingthespecificruleof“Allow:/locations/london”.   Robots.txtPrioritisation WhenseveralallowanddisallowrulesapplytoaURL,thelongestmatchingruleistheonethatisapplied.Let’slookatwhatwouldhappenfortheURL“/home/search/shirts”withthefollowingrules: Disallow:/home Allow:*search/* Disallow:*shirts Inthiscase,theURLisallowedtobecrawledbecausetheAllowrulehas9characters,whereasthedisallowrulehasonly7.IfyouneedaspecificURLpathtobeallowedordisallowed,youcanutilise*tomakethestringlonger.Forexample: Disallow:*******************/shirts WhenaURLmatchesbothanallowruleandadisallowrule,buttherulesarethesamelength,thedisallowwillbefollowed.Forexample,theURL“/search/shirts”willbedisallowedinthefollowingscenario: Disallow:/search Allow:*shirts   Robots.txtDirectives Pageleveldirectives(whichwe’llcoverlateroninthisguide)aregreattools,buttheissuewiththemisthatsearchenginesmustcrawlapagebeforebeingabletoreadtheseinstructions,whichcanconsumecrawlbudget. Robots.txtdirectivescanhelptoreducethestrainoncrawlbudgetbecauseyoucanadddirectivesdirectlyintoyourrobots.txtfileratherthanwaitingforsearchenginestocrawlpagesbeforetakingactiononthem.Thissolutionismuchquickerandeasiertomanage. Thefollowingrobots.txtdirectivesworkinthesamewayastheallowanddisallowdirectives,inthatyoucanspecifywildcards(*)andusethe$symboltodenotetheendofaURLstring. Robots.txtNoIndex Robots.txtnoindexisausefultoolformanagingsearchengineindexingwithoutusingupcrawlbudget.Disallowingapageinrobots.txtdoesn’tmeanitisremovedfromtheindex,sothenoindexdirectiveismuchmoreeffectivetouseforthispurpose. Googledoesn’tofficiallysupportrobots.txtnoindex,andyoushouldn’trelyonitbecausealthoughitworkstoday,itmaynotdosotomorrow.Thistoolcanbehelpfulthoughandshouldbeusedasashorttermfixincombinationwithotherlonger-termindexcontrols,butnotasamission-criticaldirective.TakealookatthetestsrunbyohgmandStoneTemplewhichbothprovethatthefeatureworkseffectively. Here’sanexampleofhowyouwoulduserobots.txtnoindex: [code] User-agent:* NoIndex:/directory NoIndex:/*?*sort= [/code] Aswellasnoindex,Googlecurrentlyunofficiallyobeysseveralotherindexingdirectiveswhenthey’replacedwithintherobots.txt.Itisimportanttonotethatnotallsearchenginesandcrawlerssupportthesedirectives,andtheoneswhichdomaystopsupportingthematanytime–youshouldn’trelyontheseworkingconsistently.   CommonRobots.txtIssues Therearesomekeyissuesandconsiderationsfortherobots.txtfileandtheimpactitcanhaveonasite’sperformance.We’vetakenthetimetolistsomeofthekeypointstoconsiderwithrobots.txtaswellassomeofthemostcommonissueswhichyoucanhopefullyavoid. Haveafallbackblockofrulesforallbots–Usingblocksofrulesforspecificuseragentstringswithouthavingafallbackblockofrulesforeveryotherbotmeansthatyourwebsitewilleventuallyencounterabotwhichdoesnothaveanyrulesetstofollow. Itisimportantthatrobots.txtiskeptuptodate–Arelativelycommonproblemoccurswhentherobots.txtissetduringtheinitialdevelopmentphaseofawebsite,butisnotupdatedasthewebsitegrows,meaningthatpotentiallyusefulpagesaredisallowed. BeawareofredirectingsearchenginesthroughdisallowedURLs–Forexample,/product>/disallowed>/category Casesensitivitycancausealotofproblems–Webmastersmayexpectasectionofawebsitenottobecrawled,butthosepagesmaycrawledbecauseofalternatecasingsi.e.“Disallow:/admin”exists,butsearchenginescrawl“/ADMIN”. Don’tdisallowbacklinkedURLs–ThispreventsPageRankfromflowingtoyoursitefromothersthatarelinkingtoyou. CrawlDelaycancausesearchissues–The“crawl-delay”directiveforcescrawlerstovisityourwebsiteslowerthantheywouldhaveliked,meaningthatyourimportantpagesmaybecrawledlessoftenthanisoptimal.ThisdirectiveisnotobeyedbyGoogleorBaidu,butissupportedbyBingandYandex. Makesuretherobots.txtonlyreturnsa5xxstatuscodeifthewholesiteisdown–Returninga5xxstatuscodefor/robots.txtindicatestosearchenginesthatthewebsiteisdownformaintenance.Thistypicallymeansthattheywilltrytocrawlthewebsiteagainlater. Robots.txtdisallowoverridestheparameterremovaltool–Bemindfulthatyourrobots.txtrulesmayoverrideparameterhandlingandanyotherindexationhintsthatyoumayhavegiventosearchengines. SitelinksSearchBoxmarkupwillworkwithinternalsearchpagesblocked–InternalsearchpagesonasitedonotneedtobecrawlablefortheSitelinksSearchBoxmarkuptowork. Disallowingamigrateddomainwillimpactthesuccessofthemigration–Ifyoudisallowamigrateddomain,searchengineswon’tbeabletofollowanyoftheredirectsfromtheoldsitetothenewone,sothemigrationisunlikelytobeasuccess.   Testing&AuditingRobots.txt Consideringjusthowharmfularobots.txtfilecanbeifthedirectiveswithinaren’thandledcorrectly,thereareafewdifferentwaysyoucantestittomakesureithasbeensetupproperly.TakealookatthisguideonhowtoauditURLsblockedbyrobots.txt,aswellastheseexamples: UseDeepCrawl–TheDisallowedPagesandDisallowedURLs(Uncrawled)reportscanshowyouwhichpagesarebeingblockedfromsearchenginesbyyourrobots.txtfile. UseGoogleSearchConsole–WiththeGSCrobots.txttestertoolyoucanseethelatestcachedversionofapage,aswellasusingtheFetchandRendertooltoseerendersfromtheGooglebotuseragentaswellasthebrowseruseragent.Thingstonote:GSConlyworksforGoogleUseragents,andonlysingleURLscanbetested. Trycombiningtheinsightsfrombothtoolsbyspot-checkingdisallowedURLsthatDeepCrawlhasflaggedwithintheGSCrobots.txttestertooltoclarifythespecificruleswhichareresultinginadisallow.   MonitoringRobots.txtChanges Whentherearelotsofpeopleworkingonasite,andwiththeissuesthatcanbecausedifevenonecharacterisoutofplaceinarobots.txtfile,constantlymonitoringyourrobots.txtiscrucial.Herearesomewaysinwhichyoucancheckforanyissues: CheckGoogleSearchConsoletoseethecurrentrobots.txtwhichGoogleisusing.Sometimesrobots.txtcanbedeliveredconditionallybasedonuseragents,sothisistheonlymethodtoseeexactlywhatGoogleisseeing. Checkthesizeoftherobots.txtfileifyouhavenoticedsignificantchangestomakesureitisunderGoogle’s500KBsizelimit. GototheGoogleSearchConsoleIndexStatusreportinadvancedmodetocross-checkrobots.txtchangeswiththenumberofdisallowedandallowedURLsonyoursite. ScheduleregularcrawlswithDeepCrawltoseethenumberofdisallowedpagesonyoursiteonanongoingbasis,soyoucantrackchanges. Next:URL-levelRobotsDirectives Author RachelCostello RachelCostelloisaFormerTechnicalSEO&ContentManageratDeepcrawl.You'llmostoftenfindherwritingandspeakingaboutallthingsSEO. Contents Whatisarobotstxtfileusedfor? Whereshouldtherobots.txtexist? Whenshouldyouuserobots.txtrules? Whenshouldn’tyouuserobots.txt? Robots.txtSyntaxandFormatting Robots.txtBlocks Robots.txtAllow Robots.txtPrioritisation Robots.txtDirectives CommonRobots.txtIssues Testing&AuditingRobots.txt MonitoringRobots.txtChanges Navigation: 1 2 Chooseabetterwaytogrow Withtoolsthatwillhelpyourealizeyourwebsite’struepotential,andsupporttohelpyougetthere,growingyourenterprisebusinessonlinehasneverbeensosimple. BookaDemo Removecookies



請為這篇文章評分?