Robots txt test

Robots txt path

Robots.txt - Everything SEOs Need to Know - Deepcrawl

2024-10-06

文章推薦指數： 80 %

投票人數：10人

Blocking URL parameters. You can use robots.txt to block URLs containing specific parameters, but this isn't always the best course of action. It is better ... Platform TechnicalSEOPlatform ThefullDeepcrawlexperienceFindoutmore SEOAnalyticsHub DetectgrowthopportunitieswithtechnicalSEOinsightsFindoutmore SEOMonitorHub Monitorsitechanges&techSEOtrendsFindoutmore SEOAutomationHub ProtectyoursitewithcodetestingautomationFindoutmore Integrations ConnectyourSEOdatawithbusinessdataFindoutmore ProfessionalServices GetstrategichelpfromtechSEOexpertsFindoutmore The#1WebsiteCrawler Ultra-fastcrawlspeeds,maximumflexibilityFindoutmore Pricing Resources Ebooks Webinars CaseStudies TechnicalSEOFundamentals Onboarding ProductGuides SEOOfficeHoursLibrary API Blog About About Clients Careers ContactUs Partners FAQs Login AnalyticsHub AutomationHub Getstarted Login AnalyticsHub AutomationHub Robots.txt RachelCostello 26minread Home>Knowledge>TechnicalSEOLibrary Inthissectionofourguidetorobotsdirectives,we’llgointomoredetailabouttherobots.txttextfileandhowitcanbeusedtoinstructthesearchenginewebcrawlers.Thisfileisespeciallyusefulformanagingcrawlbudgetandmakingsuresearchenginesarespendingtheirtimeonyoursiteefficientlyandcrawlingonlytheimportantpages. Whatisarobotstxtfileusedfor? Therobots.txtfileistheretotellcrawlersandrobotswhichURLstheyshouldnotvisitonyourwebsite.Thisisimportanttohelpthemavoidcrawlinglowqualitypages,orgettingstuckincrawltrapswhereaninfinitenumberofURLscouldpotentiallybecreated,forexample,acalendarsectionwhichcreatesanewURLforeveryday. AsGoogleexplainsintheirrobots.txtspecificationsguide,thefileformatshouldbeplaintextencodedinUTF-8.Thefile’srecords(orlines)shouldbeseparatedbyCR,CR/LForLF. Youshouldbemindfulofthesizeofarobots.txtfile,assearchengineshavetheirownmaximumfilesizelimits.ThemaximumsizeforGoogleis500KB. Whereshouldtherobots.txtexist? Therobots.txtshouldalwaysexistontherootofthedomain,forexample: Thisfileisspecifictotheprotocolandfulldomain,sotherobots.txtonhttps://www.example.comdoesnotimpactthecrawlingofhttps://www.example.comorhttps://subdomain.example.com;theseshouldhavetheirownrobots.txtfiles. Whenshouldyouuserobots.txtrules? Ingeneral,websitesshouldtrytoutilisetherobots.txtaslittleaspossibletocontrolcrawling.Improvingyourwebsite’sarchitectureandmakingitcleanandaccessibleforcrawlersisamuchbettersolution.However,usingrobots.txtwherenecessarytopreventcrawlersfromaccessinglowqualitysectionsofthesiteisrecommendediftheseproblemscannotbefixedintheshortterm. Googlerecommendsonlyusingrobots.txtwhenserverissuesarebeingcausedorforcrawlefficiencyissues,suchasGooglebotspendingalotoftimecrawlinganon-indexablesectionofasite,forexample. Someexamplesofpageswhichyoumaynotwanttobecrawledare: Categorypageswithnon-standardsortingasthisgenerallycreatesduplicationwiththeprimarycategorypage User-generatedcontentthatcannotbemoderated Pageswithsensitiveinformation Internalsearchpagesastherecanbeaninfiniteamountoftheseresultpageswhichprovidesapooruserexperienceandwastescrawlbudget Whenshouldn’tyouuserobots.txt? Therobots.txtfileisausefultoolwhenusedcorrectly,however,thereareinstanceswhereitisn’tthebestsolution.Herearesomeexamplesofwhennottouserobots.txttocontrolcrawling: 1.BlockingJavascript/CSS Searchenginesneedtobeabletoaccessallresourcesonyoursitetocorrectlyrenderpages,whichisanecessarypartofmaintaininggoodrankings.JavaScriptfileswhichdramaticallychangetheuserexperience,butaredisallowedfromcrawlingbysearchenginesmayresultinmanualoralgorithmicpenalties. Forinstance,ifyouserveanadinterstitialorredirectuserswithJavaScriptthatasearchenginecannotaccess,thismaybeseenascloakingandtherankingsofyourcontentmaybeadjustedaccordingly. 2.BlockingURLparameters Youcanuserobots.txttoblockURLscontainingspecificparameters,butthisisn’talwaysthebestcourseofaction.ItisbettertohandletheseinGoogleSearchconsoleastherearemoreparameter-specificoptionstheretocommunicatepreferredcrawlingmethodstoGoogle. YoucouldalsoplacetheinformationinaURLfragment(/page#sort=price),assearchenginesdonotcrawlthis.Additionally,ifaURLparametermustbeused,thelinkstoitcouldcontaintherel=nofollowattributetopreventcrawlersfromtryingtoaccessit. 3.BlockingURLswithbacklinks DisallowingURLswithintherobots.txtpreventslinkequityfrompassingthroughtothewebsite.ThismeansthatifsearchenginesareunabletofollowlinksfromotherwebsitesasthetargetURLisdisallowed,yourwebsitewillnotgaintheauthoritythatthoselinksarepassingandasaresult,youmaynotrankaswelloverall. 4.Gettingindexedpagesdeindexed UsingDisallowdoesn’tgetpagesdeindexed,andeveniftheURLisblockedandsearchengineshavenevercrawledthepage,disallowedpagesmaystillgetindexed.Thisisbecausethecrawlingandindexingprocessesarelargelyseparate. 5.Settingruleswhichignoresocialnetworkcrawlers Evenifyoudon’twantsearchenginestocrawlandindexpages,youmaywantsocialnetworkstobeabletoaccessthosepagessothatapagesnippetcanbebuilt.Forexample,Facebookwillattempttovisiteverypagethatgetspostedonthenetwork,sothattheycanservearelevantsnippet.Keepthisinmindwhensettingrobots.txtrules. 6.Blockingaccessfromstagingordevsites Usingtherobots.txttoblockanentirestagingsiteisn’tbestpractice.Googlerecommendsnoindexingthepagesbutallowingthemtobecrawled,butingeneralitisbettertorenderthesiteinaccessiblefromtheoutsideworld. 7.Whenyouhavenothingtoblock Somewebsiteswithaverycleanarchitecturehavenoneedtoblockcrawlersfromanypages.Inthissituationit’sperfectlyacceptablenottohavearobots.txtfile,andreturna404statuswhenit’srequested. Robots.txtSyntaxandFormatting Nowthatwe’velearntwhatrobots.txtisandwhenitshouldandshouldn’tbeused,let’stakealookatthestandardisedsyntaxandformattingrulesthatshouldbeadheredtowhenwritingarobots.txtfile. Comments Commentsarelinesthatarecompletelyignoredbysearchengines andstartwitha#.Theyexisttoallowyoutowritenotesaboutwhateachlineofyourrobots.txtdoes,whyitexists,andwhenitwasadded.Ingeneral,itisadvisedtodocumentthepurposeofeverylineofyourrobots.txtfile,sothatitcanberemovedwhenitisnolongernecessaryandisnotmodifiedwhileitisstillessential. SpecifyingUser-agent Ablockofrulescanbeappliedtospecificuseragentsusingthe“User-agent”directive.Forinstance,ifyouwantedcertainrulestoapplytoGoogle,Bing,andYandex;butnotFacebookandadnetworks,thiscanbeachievedbyspecifyingauseragenttokenthatasetofrulesappliesto. Eachcrawlerhasitsownuser-agenttoken,whichisusedtoselectthematchingblocks. Crawlerswillfollowthemostspecificuseragentrulessetforthemwiththenameseparatedbyhyphens,andwillthenfallbacktomoregenericrulesifanexactmatchisn’tfound.Forexample,GooglebotNewswilllookforamatchof‘googlebot-news’,then‘googlebot’,then‘*’. Herearesomeofthemostcommonuseragenttokensyou’llcomeacross: *–Therulesapplytoeverybot,unlessthereisamorespecificsetofrules Googlebot–AllGooglecrawlers Googlebot-News–CrawlerforGoogleNews Googlebot-Image–CrawlerforGoogleImages Mediapartners-Google–GoogleAdsensecrawler Bingbot–Bing’scrawler Yandex–Yandex’scrawler Baiduspider–Baidu’scrawler Facebot–Facebook’scrawler Twitterbot–Twitter’scrawler Thislistofuseragenttokensisbynomeansexhaustive,sotolearnmoreaboutsomeofthecrawlersoutthere,takealookatthedocumentationpublishedbyGoogle,Bing,Yandex,Baidu,FacebookandTwitter. Thematchingofauseragenttokentoarobots.txtblockisnotcasesensitive.E.g.‘googlebot’willmatchGoogle’suseragenttoken‘Googlebot’. PatternMatchingURLs YoumighthaveaparticularURLstringyouwanttoblockfrombeingcrawled,asthisismuchmoreefficientthanincludingafulllistofcompleteURLstobeexcludedinyourrobots.txtfile. TohelpyourefineyourURLpaths,youcanusethe*and$symbols.Here’showtheywork: *–Thisisawildcardandrepresentsanyamountofanycharacter.ItcanbeatthestartorinthemiddleofaURLpath,butisn’trequiredattheend.YoucanusemultiplewildcardswithinaURLstring,forexample,“Disallow:*/products?*sort=”.Ruleswithfullpathsshouldnotstartwithawildcard. $–ThischaractersignifiestheendofaURLstring,so“Disallow:*/dress$”willmatchonlyURLsendingin“/dress”,andnot“/dress?parameter”. It’sworthnotingthatrobots.txtrulesarecasesensitive,meaningthatifyoudisallowURLswiththeparameter“search”(e.g.“Disallow:*?search=”),robotsmightstillcrawlURLswithdifferentcapitalisation,suchas“?Search=anything”. ThedirectiverulesmatchagainstURLpathsonly,andcan’tincludeaprotocolorhostname.AslashatthestartofadirectivematchesagainstthestartoftheURLpath.E.g.“Disallow:/starts”wouldmatchtowww.example.com/starts. Unlessyouaddastartadirectivematchwitha/or*,itwillnotmatchanything.E.g.“Disallow:starts”wouldnevermatchanything. TohelpvisualisethewaysdifferentURLsruleswork,we’veputtogethersomeexamplesforyou: Robots.txtSitemapLink Thesitemapdirectiveinarobots.txtfiletellssearchengineswheretofindtheXMLsitemap,whichhelpsthemtodiscoveralltheURLsonthewebsite.Tolearnmoreaboutsitemaps,takealookatourguideonsitemapauditsandadvancedconfiguration. Whenincludingsitemapsinarobots.txtfile,youshoulduseabsoluteURLs(i.e.https://www.example.com/sitemap.xml)insteadofrelativeURLs(i.e./sitemap.xml.)It’salsoworthnotingthatsitemapsdon’thavetositononerootdomain,theycanalsobehostedonanexternaldomain. Searchengineswilldiscoverandmaycrawlthesitemapslistedinyourrobots.txtfile,however,thesesitemapswillnotappearinGoogleSearchConsoleorBingWebmasterToolswithoutmanualsubmission. Robots.txtBlocks The“disallow”ruleintherobots.txtfilecanbeusedinanumber ofwaysfordifferentuseragents.Inthissection,we’llcoversomeofthedifferentwaysyoucanformatcombinationsofblocks. It’simportanttorememberthatdirectivesintherobots.txtfileareonlyinstructions.Maliciouscrawlerswillignoreyourrobots.txtfileandcrawlanypartofyoursitethatispublic,sodisallowshouldnotbeusedinplaceofrobustsecuritymeasures. MultipleUser-agentblocks Youcanmatchablockofrulestomultipleuseragentsbylistingthembeforeasetofrules,forexample,thefollowingdisallowruleswillapplytobothGooglebotandBinginthefollowingblockofrules: User-agent:googlebot User-agent:bing Disallow:/a Spacingbetweenblocksofdirectives Googlewillignorespacesbetweendirectivesandblocks.Inthisfirstexample,thesecondrulewillbepickedup,eventhoughthereisaspaceseparatingthetwopartsoftherule: [code] User-agent:* Disallow:/disallowed/ Disallow:/test1/robots_excluded_blank_line [/code] Inthissecondexample,Googlebot-mobilewouldinheritthesamerulesasBingbot: [code] User-agent:googlebot-mobile User-agent:bing Disallow:/test1/deepcrawl_excluded [/code] Separateblockscombined Multipleblockswiththesameuseragentarecombined.Sointheexamplebelow,thetopandbottomblockswouldbecombinedandGooglebotwouldbedisallowedfromcrawling“/b”and“/a”. User-agent:googlebot Disallow:/b User-agent:bing Disallow:/a User-agent:googlebot Disallow:/a Robots.txtAllow Therobots.txt“allow”ruleexplicitlygivespermissionforcertainURLstobecrawled.WhilethisisthedefaultforallURLs,thisrulecanbeusedtooverwriteadisallowrule.Forexample,if“/locations”isdisallowed,youcouldallowthecrawlingof“/locations/london”byhavingthespecificruleof“Allow:/locations/london”. Robots.txtPrioritisation WhenseveralallowanddisallowrulesapplytoaURL,thelongestmatchingruleistheonethatisapplied.Let’slookatwhatwouldhappenfortheURL“/home/search/shirts”withthefollowingrules: Disallow:/home Allow:*search/* Disallow:*shirts Inthiscase,theURLisallowedtobecrawledbecausetheAllowrulehas9characters,whereasthedisallowrulehasonly7.IfyouneedaspecificURLpathtobeallowedordisallowed,youcanutilise*tomakethestringlonger.Forexample: Disallow:*******************/shirts WhenaURLmatchesbothanallowruleandadisallowrule,buttherulesarethesamelength,thedisallowwillbefollowed.Forexample,theURL“/search/shirts”willbedisallowedinthefollowingscenario: Disallow:/search Allow:*shirts Robots.txtDirectives Pageleveldirectives(whichwe’llcoverlateroninthisguide)aregreattools,buttheissuewiththemisthatsearchenginesmustcrawlapagebeforebeingabletoreadtheseinstructions,whichcanconsumecrawlbudget. Robots.txtdirectivescanhelptoreducethestrainoncrawlbudgetbecauseyoucanadddirectivesdirectlyintoyourrobots.txtfileratherthanwaitingforsearchenginestocrawlpagesbeforetakingactiononthem.Thissolutionismuchquickerandeasiertomanage. Thefollowingrobots.txtdirectivesworkinthesamewayastheallowanddisallowdirectives,inthatyoucanspecifywildcards(*)andusethe$symboltodenotetheendofaURLstring. Robots.txtNoIndex Robots.txtnoindexisausefultoolformanagingsearchengineindexingwithoutusingupcrawlbudget.Disallowingapageinrobots.txtdoesn’tmeanitisremovedfromtheindex,sothenoindexdirectiveismuchmoreeffectivetouseforthispurpose. Googledoesn’tofficiallysupportrobots.txtnoindex,andyoushouldn’trelyonitbecausealthoughitworkstoday,itmaynotdosotomorrow.Thistoolcanbehelpfulthoughandshouldbeusedasashorttermfixincombinationwithotherlonger-termindexcontrols,butnotasamission-criticaldirective.TakealookatthetestsrunbyohgmandStoneTemplewhichbothprovethatthefeatureworkseffectively. Here’sanexampleofhowyouwoulduserobots.txtnoindex: [code] User-agent:* NoIndex:/directory NoIndex:/*?*sort= [/code] Aswellasnoindex,Googlecurrentlyunofficiallyobeysseveralotherindexingdirectiveswhenthey’replacedwithintherobots.txt.Itisimportanttonotethatnotallsearchenginesandcrawlerssupportthesedirectives,andtheoneswhichdomaystopsupportingthematanytime–youshouldn’trelyontheseworkingconsistently. CommonRobots.txtIssues Therearesomekeyissuesandconsiderationsfortherobots.txtfileandtheimpactitcanhaveonasite’sperformance.We’vetakenthetimetolistsomeofthekeypointstoconsiderwithrobots.txtaswellassomeofthemostcommonissueswhichyoucanhopefullyavoid. Haveafallbackblockofrulesforallbots–Usingblocksofrulesforspecificuseragentstringswithouthavingafallbackblockofrulesforeveryotherbotmeansthatyourwebsitewilleventuallyencounterabotwhichdoesnothaveanyrulesetstofollow. Itisimportantthatrobots.txtiskeptuptodate–Arelativelycommonproblemoccurswhentherobots.txtissetduringtheinitialdevelopmentphaseofawebsite,butisnotupdatedasthewebsitegrows,meaningthatpotentiallyusefulpagesaredisallowed. BeawareofredirectingsearchenginesthroughdisallowedURLs–Forexample,/product>/disallowed>/category Casesensitivitycancausealotofproblems–Webmastersmayexpectasectionofawebsitenottobecrawled,butthosepagesmaycrawledbecauseofalternatecasingsi.e.“Disallow:/admin”exists,butsearchenginescrawl“/ADMIN”. Don’tdisallowbacklinkedURLs–ThispreventsPageRankfromflowingtoyoursitefromothersthatarelinkingtoyou. CrawlDelaycancausesearchissues–The“crawl-delay”directiveforcescrawlerstovisityourwebsiteslowerthantheywouldhaveliked,meaningthatyourimportantpagesmaybecrawledlessoftenthanisoptimal.ThisdirectiveisnotobeyedbyGoogleorBaidu,butissupportedbyBingandYandex. Makesuretherobots.txtonlyreturnsa5xxstatuscodeifthewholesiteisdown–Returninga5xxstatuscodefor/robots.txtindicatestosearchenginesthatthewebsiteisdownformaintenance.Thistypicallymeansthattheywilltrytocrawlthewebsiteagainlater. Robots.txtdisallowoverridestheparameterremovaltool–Bemindfulthatyourrobots.txtrulesmayoverrideparameterhandlingandanyotherindexationhintsthatyoumayhavegiventosearchengines. SitelinksSearchBoxmarkupwillworkwithinternalsearchpagesblocked–InternalsearchpagesonasitedonotneedtobecrawlablefortheSitelinksSearchBoxmarkuptowork. Disallowingamigrateddomainwillimpactthesuccessofthemigration–Ifyoudisallowamigrateddomain,searchengineswon’tbeabletofollowanyoftheredirectsfromtheoldsitetothenewone,sothemigrationisunlikelytobeasuccess. Testing&AuditingRobots.txt Consideringjusthowharmfularobots.txtfilecanbeifthedirectiveswithinaren’thandledcorrectly,thereareafewdifferentwaysyoucantestittomakesureithasbeensetupproperly.TakealookatthisguideonhowtoauditURLsblockedbyrobots.txt,aswellastheseexamples: UseDeepCrawl–TheDisallowedPagesandDisallowedURLs(Uncrawled)reportscanshowyouwhichpagesarebeingblockedfromsearchenginesbyyourrobots.txtfile. UseGoogleSearchConsole–WiththeGSCrobots.txttestertoolyoucanseethelatestcachedversionofapage,aswellasusingtheFetchandRendertooltoseerendersfromtheGooglebotuseragentaswellasthebrowseruseragent.Thingstonote:GSConlyworksforGoogleUseragents,andonlysingleURLscanbetested. Trycombiningtheinsightsfrombothtoolsbyspot-checkingdisallowedURLsthatDeepCrawlhasflaggedwithintheGSCrobots.txttestertooltoclarifythespecificruleswhichareresultinginadisallow. MonitoringRobots.txtChanges Whentherearelotsofpeopleworkingonasite,andwiththeissuesthatcanbecausedifevenonecharacterisoutofplaceinarobots.txtfile,constantlymonitoringyourrobots.txtiscrucial.Herearesomewaysinwhichyoucancheckforanyissues: CheckGoogleSearchConsoletoseethecurrentrobots.txtwhichGoogleisusing.Sometimesrobots.txtcanbedeliveredconditionallybasedonuseragents,sothisistheonlymethodtoseeexactlywhatGoogleisseeing. Checkthesizeoftherobots.txtfileifyouhavenoticedsignificantchangestomakesureitisunderGoogle’s500KBsizelimit. GototheGoogleSearchConsoleIndexStatusreportinadvancedmodetocross-checkrobots.txtchangeswiththenumberofdisallowedandallowedURLsonyoursite. ScheduleregularcrawlswithDeepCrawltoseethenumberofdisallowedpagesonyoursiteonanongoingbasis,soyoucantrackchanges. Next:URL-levelRobotsDirectives Author RachelCostello RachelCostelloisaFormerTechnicalSEO&ContentManageratDeepcrawl.You'llmostoftenfindherwritingandspeakingaboutallthingsSEO. Contents Whatisarobotstxtfileusedfor? Whereshouldtherobots.txtexist? Whenshouldyouuserobots.txtrules? Whenshouldn’tyouuserobots.txt? Robots.txtSyntaxandFormatting Robots.txtBlocks Robots.txtAllow Robots.txtPrioritisation Robots.txtDirectives CommonRobots.txtIssues Testing&AuditingRobots.txt MonitoringRobots.txtChanges Navigation: 1 2 Chooseabetterwaytogrow Withtoolsthatwillhelpyourealizeyourwebsite’struepotential,andsupporttohelpyougetthere,growingyourenterprisebusinessonlinehasneverbeensosimple. BookaDemo Removecookies

請為這篇文章評分？

延伸文章資訊

Google: Do Not Use Robots.txt To Block Indexing Of URLs ...

Google's John Mueller said you should absolutely not "use robots.txt to block indexing of URLs wi...

Ignore URLs in robot.txt with specific parameters?

Here's a solutions if you want to disallow query strings: Disallow: /*?*. or if you want to be mo...

How would you disallow a dynamic URL parameter in robots.txt?

If you want to disallow subdomain URLs from being crawled and indexed by search engines, you can ...

Should I disallow all URL query strings/parameters in Robots ...

Add "Disallow: *?" to hide all query/parameter URLs from Google. ... this when I added an exclusi...

Preventing Duplicate Content with Query Parameters - Siteglide

Preventing Duplicate Content with Query Parameters- Canonical URL and Robots.txt ... Engines from...

Robots.txt - Everything SEOs Need to Know - Deepcrawl

文章推薦指數： 80 %

請為這篇文章評分？

延伸文章資訊

最新文章

相關網站資訊

更年期食療

更年期情緒

更年期症状

胰島素肥胖

胰島素阻抗

皮膚紅腫硬硬的

親子天下寶寶生活

毒品危害防制條例修法

抑鬱症安慰

應用寶

攝護腺

恐慌症睡覺

Robots.txt - Everything SEOs Need to Know - Deepcrawl

文章推薦指數： 80 %

請為這篇文章評分？

延伸文章資訊

最新文章

相關網站資訊

更年期食療

更年期情緒

更年期症状

胰島素肥胖

胰島素阻抗

皮膚紅腫硬硬的

親子天下 寶寶生活

毒品危害防制條例修法

抑鬱症安慰

應用寶

攝護腺

恐慌症睡覺

親子天下寶寶生活