Knowledge Graphs With Machine Learning [Guide] - Neptune.ai

文章推薦指數: 80 %
投票人數:10人

Knowledge graphs come in a variety of shapes and sizes. Web scraping, computational linguistics, NLP algorithms, and graph theory (with Python ... WeRaised$8MSeriesAtoContinueBuildingExperimentTrackingandModelRegistryThat“JustWorks” Readmore Blog»NaturalLanguageProcessing»KnowledgeGraphsWithMachineLearning[Guide] WorkingonanNLPproject? Youmaybespendingtoomuchtimedocumentingit.Addingametadatastoretoyourworkflowcanchangethis. Seeexampledashboard Youneedtogetsomeinformationonline.Forexample,afewparagraphsaboutUsainBolt.YoucancopyandpastetheinformationfromWikipedia,itwon’tbemuchwork.  ButwhatifyouneededtogetinformationaboutallcompetitionsthatUsainBolthadtakenpartin,andallrelatedstatsabouthimandhiscompetitors?Andthenwhatifyouwantedtodothatforallsports,notjustrunning? Machinelearningengineersoftenneedtobuildcomplexdatasetsliketheexampleabovetotraintheirmodels.Webscrapingisaveryusefulmethodtocollectthenecessarydata,butitcomeswithsomechallenges. Inthisarticle,I’mgoingtoexplainhowtoscrapepubliclyavailabledataandbuildknowledgegraphsfromscrapeddata,alongwithsomekeyconceptsfromNaturalLanguageProcessing(NLP). Whatiswebscraping? Webscraping(orwebharvesting)isdatascrapingusedfordataextraction.Thetermtypicallyreferstocollectingdatawithabotorwebcrawler.It’saformofcopyinginwhichspecificdataisgatheredandcopiedfromtheweb,typicallyintoalocaldatabaseorspreadsheetforlateruseoranalysis. Source Youcandowebscrapingwithonlineservices,APIs,oryoucanwriteyourowncodethatwilldoit.  Therearetwokeyelementstowebscraping: Crawler:Thecrawlerisanalgorithmthatbrowsesthewebtosearchforparticulardatabyexploringlinksacrosstheinternet.Scraper:Thescraperextractsdatafromwebsites.Thedesignofscraperscanvaryalot.Itdependsonthecomplexityandscopeoftheproject.Ultimatelyithastoquicklyandaccuratelyextractthedata. Agoodexampleofaready-madelibraryistheWikipediascraperlibrary.Itdoesalotoftheheavyliftingforyou.YouprovideURLswiththerequireddata,itloadsalltheHTMLfromthosesites.ThescrapertakesthedatayouneedfromthisHTMLcodeandoutputsthedatainyourchosenformat.ThiscanbeanexcelspreadsheetorCSV,oraformatlikeJSON. Knowledgegraph Theamountofcontentavailableonthewebisincrediblealready,andit’sexpandingatanincreasinglyfastrate.BillionsofwebsitesarelinkedwiththeWorldWideWeb,andsearchenginescangothroughthoselinksandserveusefulinformationwithgreatprecisionandspeed.Thisisinpartthankstoknowledgegraphs. Differentorganizationshavedifferentknowledgegraphs.Forexample,theGoogleKnowledgeGraphisaknowledgebaseusedbyGoogleanditsservicestoenhancesearchengineresultswithinformationgatheredfromavarietyofsources.SimilartechniquesareusedinFacebook,orAmazonproductsforabetteruserexperience,andtostoreandretrieveusefulinformation.  There’snoformaldefinitionofaknowledgegraph(KG).Broadlyspeaking,aKGisa kindofsemanticnetworkwithaddedconstraints.Itsscope,structureandcharacteristics,andevenitsusesaren’tfullyrealizedintheprocessofdevelopment. Bringingknowledgegraphsandmachinelearning(ML)togethercansystematicallyimprovetheaccuracyofsystemsandextendtherangeofmachinelearningcapabilities.Thankstoknowledgegraphs,resultsinferredfrommachinelearningmodelswillhavebetterexplainabilityandtrustworthiness.  BringingknowledgegraphsandMLtogethercreatessomeinterestingopportunities.Incaseswherewemighthaveinsufficientdata,KGscanbeusedtoaugmenttrainingdata.OneofthemajorchallengesinMLmodelsisexplainingpredictionsmadebyMLsystems.Knowledgegraphscanhelpovercomethisissuebymappingexplanationstopropernodesinthegraphandsummarizingthedecision-makingprocess. Readalso DataAugmentationinNLP:BestPracticesFromaKaggleMaster Anotherwaytolookatitisthataknowledgegraphstoresdatathatresultedfromaninformationextractiontask.ManyimplementationsofKGmakeuseofaconceptcalledtriplet—asetofthreeitems(asubject,apredicate,andanobject)thatwecanusetostoreinformationaboutsomething.  Yetanotherexplanation:knowledgegraphsareadatasciencetoolthatdealswithinterconnectedentities(organizations,people,events,places).Entitiesarenodesconnectedviaedges.KGshaveentitypairsthatcanbetraversedtouncovermeaningfulconnectionsinunstructureddata. NodeAandNodeBare2differententities.Thesenodesareconnectedbyanedgethatrepresentstheirrelationship.ThisisthesmallestKGwecanbuild–alsoknownasatriple.Knowledgegraphscomeinavarietyofshapesandsizes. Webscraping,computationallinguistics,NLPalgorithms,andgraphtheory(withPythoncode) Phew,that’sawordyheading.Anyway,tobuildknowledgegraphsfromtext,it’simportanttohelpourmachineunderstandnaturallanguage.WedothiswithNLPtechniquessuchassentencesegmentation,dependencyparsing,parts-of-speech(POS)tagging,andentityrecognition. ThefirststeptobuildaKGistocollectyoursources—let’scrawlthewebforsomeinformation.Wikipediawillbeoursource(alwayscheckthesourcesofdata,alotofinformationonlineisfalse). Forthisblog,we’llbeusingtheWikipediaAPI,adirectPythonwrapper.Neptunetomanagemodelbuildingmetadatainasingleplace.Log,store,display,organize,compareandqueryallyourMLOpsmetadata. Experimenttrackingandmodelregistrybuiltforresearchandproductionteamsthatrunalotofexperiments. Installationandsetup Installdependenciesandscrapedata !pipinstallwikipedia-apineptune-clientneptune-notebookspandasspacynetworkxscipy Mightbeuseful FollowtheselinksforinstallationandsettingupNeptuneonyournotebook: –gettingstartedwithNeptune –NeptuneJupyterextensionguide ThebelowfunctionsearchesWikipediaforagiventopicandextractsinformationfromthetargetpageanditsinternallinks. importwikipediaapi#pipinstallwikipedia-api importpandasaspd importconcurrent.futures fromtqdmimporttqdm Thebelowfunctionletsyoufetchthearticlesbasedonthetopicyouprovideasaninputtothefunction. defscrape_wikipedia(name_topic,verbose=True): deflink_to_wikipedia(link): try: page=api_wikipedia.page(link) ifpage.exists(): return{'page':link,'text':page.text,'link':page.fullurl,'categories':list(page.categories.keys())} except: returnNone api_wikipedia=wikipediaapi.Wikipedia(language='en',extract_format=wikipediaapi.ExtractFormat.WIKI) name_of_page=api_wikipedia.page(name_topic) ifnotname_of_page.exists(): print('Page{}isnotpresent'.format(name_of_page)) return links_to_page=list(name_of_page.links.keys()) procceed=tqdm(desc='Scrapedlinks',unit='',total=len(links_to_page))ifverboseelseNone origin=[{'page':name_topic,'text':name_of_page.text,'link':name_of_page.fullurl,'categories':list(name_of_page.categories.keys())}] withconcurrent.futures.ThreadPoolExecutor(max_workers=5)asexecutor: links_future={executor.submit(link_to_wikipedia,link):linkforlinkinlinks_to_page} forfutureinconcurrent.futures.as_completed(links_future): info=future.result() origin.append(info)ifinfoelseNone procceed.update(1)ifverboseelseNone procceed.close()ifverboseelseNone namespaces=('Wikipedia','Special','Talk','LyricWiki','File','MediaWiki', 'Template','Help','User','Categorytalk','Portaltalk') origin=pds.DataFrame(origin) origin=origin[(len(origin['text'])>20) &~(origin['page'].str.startswith(namespaces,na=True))] origin['categories']=origin.categories.apply(lambdaa:[b[9:]forbina]) origin['topic']=name_topic print('Scrapedpages',len(origin)) returnorigin Letstestthefunctiononthetopic“COVID-19”. wiki_data=wiki_scrape('COVID19') o/p:LinksScraped:100%|██████████|1965/1965[04:30<00:00,7.25/s]agesscraped:1749 Savethedatetocsv: data_wikipedia.to_csv('scraped_data.csv') Importlibraries: importspacy importpandasaspd importrequests fromspacyimportdisplacy #importen_core_web_sm nlp=spacy.load('en_core_web_sm') fromspacy.tokensimportSpan fromspacy.matcherimportMatcher importmatplotlib.pyplotasplot fromtqdmimporttqdm importnetworkxasntx importneptune.newasneptune %matplotlibinline run=neptune.init(api_token="yourAPIkey", project="aravindcr/KnowledgeGraphs") UploaddatatoNeptune: run["data"].upload("scraped_data.csv") Downloadthedatahere.AlsoavailableonNeptune: data=pd.read_csv('scraped_data.csv') Viewdataat10throw: data['text'][10] Output: TheAbC-19rapidantibodytestisanimmunologicaltestforCOVID-19exposure developedbytheUKRapidTestConsortiumandmanufacturedbyAbingdon Health.ItusesalateralflowtesttodeterminewhetherapersonhasIgG antibodiestotheSARS-CoV-2virusthatcausesCOVID-19.Thetestusesasingle dropofbloodobtainedfromafingerprickandyieldsresultsin20minutes. Sentencesegmentation Thefirststepofbuildingaknowledgegraphistosplitthetextdocumentorarticleintosentences.Thenwelimitourexamplestosimplesentenceswithonesubjectandoneobject. #Letstakepartoftheaboveextractedarticle docu=nlp('''TheAbC-19rapidantibodytestisanimmunologicaltestforCOVID-19exposuredevelopedby theUKRapidTestConsortiumandmanufacturedbyAbingdonHealth.Itusesalateralflowtesttodetermine whetherapersonhasIgGantibodiestotheSARS-CoV-2virusthatcausesCOVID-19.Thetestusesasingle dropofbloodobtainedfromafingerprickandyieldsresultsin20minutes.\n\nSeealso\nCOVID-19rapid antigentest''') fortoknindocu: print(tokn.text,"---",tokn.dep_) Downloadthepre-trainedSpaCymodelasshownbelow: python-mspacydownloaden TheSpaCypipelineassignswordvectors,context-specifictokenvectors,part-of-speechtags,dependencyparsing,andnamedentities.ByextendingSpaCy’spipelineofannotationsyoucanresolvecoreferences(explainedbelowwrittencode). Knowledgegraphscanbeautomaticallyconstructedfromparts-of-speechanddependencyparsing.ExtractionofentitypairsfromgrammaticalpatternsisfastandscalabletolargeamountsoftextusingtheNLPlibrarySpaCy. Thefollowingfunctiondefinesentitypairsasentities/nounchunkswithsubject-objectdependenciesconnectedbyarootverb.Otherapproximationscanbeusedtoproducedifferenttypesofconnections.Thiskindofconnectioncanbereferredtoassubject-predicate-objecttriple. Themainideaistogothroughasentenceandextractthesubjectandobject,andwhenthey’reencountered.Thebelowfunctionhassomeofthestepsmentioned. Entityextraction Youcanextractasinglewordentityfromasentencewiththehelpofparts-of-speech(POS)tags.Thenounsandpropernounswillbetheentities.  However,whenanentityspansmultiplewords,POStagsalonearen’tsufficient.Weneedtoparsethedependencytreeofthesentence.Tobuildaknowledgegraph,themostimportantthingsarethenodesandedgesbetweenthem.  ThesenodesaregoingtobeentitiesthatarepresentintheWikipediasentences.Edgesaretherelationshipsconnectingtheseentities.Wewillextracttheseelementsinanunsupervisedmanner,i.e.we’llusethegrammarofthesentences. Theideaistogothroughasentenceandextractthesubjectandtheobjectasandwhentheyarereconstructed. defextract_entities(sents): #chunkone enti_one="" enti_two="" dep_prev_token=""#dependencytagofprevioustokeninsentence txt_prev_token=""#previoustokeninsentence prefix="" modifier="" fortokninnlp(sents): #chunktwo ##movetonexttokeniftokenispunctuation iftokn.dep_!="punct": #checkiftokeniscompoundwordornot iftokn.dep_=="compound": prefix=tokn.text #addthecurrentwordtoitifthepreviouswordis'compound’ ifdep_prev_token=="compound": prefix=txt_prev_token+""+tokn.text #verifyiftokenismodifierornot iftokn.dep_.endswith("mod")==True: modifier=tokn.text #addittothecurrentwordifthepreviouswordis'compound' ifdep_prev_token=="compound": modifier=txt_prev_token+""+tokn.text #chunk3 iftokn.dep_.find("subj")==True: enti_one=modifier+""+prefix+""+tokn.text prefix="" modifier="" dep_prev_token="" txt_prev_token="" #chunk4 iftokn.dep_.find("obj")==True: enti_two=modifier+""+prefix+""+tokn.text #chunk5 #updatevariable dep_prev_token=tokn.dep_ txt_prev_token=tokn.text return[enti_one.strip(),enti_two.strip()] extract_entities("TheAbC-19rapidantibodytestisanimmunologicaltestforCOVID-19exposuredevelopedbytheUKRapidTest") ['AbC-19rapidantibodytest','COVID-19UKRapidTest'] Nowlet’susethefunctiontoextractentitypairsfor800sentences. pairs_of_entities=[] foriintqdm(data['text'][:800]): pairs_of_entities.append(extract_entities(i)) Subjectobjectpairsfromsentences: pairs_of_entities[36:42] Output: [['wherealuminiumpowder','suchexplosivesmanufacturing'], ['310people','CancerResearchUK'], ['StructuralExternallinks','2PDBeKB'], ['which','1MedicalSubjectHeadings'], ['StructuralExternallinks','2PDBeKB'], ['users','permanentlytaste']] Relationsextraction Withentityextraction,halfthejobisdone.Tobuildaknowledgegraph,weneedtoconnectthenodes(entities).Theseedgesarerelationsbetweenpairsofnodes.Thefunctionbelowiscapableofcapturingsuchpredicatesfromthesesentences.IusedspaCy’srule-basedmatching.ThepatterndefinedinthefunctiontriestofindtheROOTwordorthemainverbinthesentence.  defobtain_relation(sent): doc=nlp(sent) matcher=Matcher(nlp.vocab) pattern=[{'DEP':'ROOT'}, {'DEP':'prep','OP':"?"}, {'DEP':'agent','OP':"?"}, {'POS':'ADJ','OP':"?"}] matcher.add("matching_1",None,pattern) matcher=matcher(doc) h=len(matcher)-1 span=doc[matcher[h][1]:matcher[h][2]] return(span.text Thepatternwhichiswrittenabovetriestofindtherootwordinsentences.Onceitisrecognizedthenitchecksifitisfollowedbyaprepositionoranagentword.Ifit’sayesthenit’saddedtotherootword. relations=[obtain_relation(j)forjintqdm(data['text'][:800])] Mostfrequentrelationsextracted: pd.Series(relations).value_counts()[:50] Let’sbuildaknowledgegraph Nowwecanfinallycreateaknowledgegraphfromtheextractedentities  Let’sdrawthenetworkusingthenetworkXlibrary.We’llcreateadirectedmultigraphnetworkwithnodesizeinproportiontodegreecentrality.Inotherwords,therelationsbetweenanyconnectednodepairarenottwo-way.They’reonlyfromonenodetoanother. #subjectextraction source=[j[0]forjinpairs_of_entities] #objectextraction target=[k[1]forkinpairs_of_entities] data_kgf=pd.DataFrame({'source':source,'target':target,'edge':relations}) Weareusingthenetworkxlibrarytocreateanetworkfromthedataframe.Herenodeswillberepresentedasentitiesandedgesrepresenttherelationshipbetweennodes #CreateDGfromthedataframe graph=ntx.from_pandas_edgelist(data_kgf,"source","target", edge_attr=True,create_using=ntx.MultiDiGraph()) #plottingthenetwork plot.figure(figsize=(14,14)) posn=ntx.spring_layout(graph) ntx.draw(graph,with_labels=True,node_color='green',edge_cmap=plot.cm.Blues,pos=posn) plot.show() Seetheimageintheapp Fromtheabovegraphit’suncleartogetasenseofwhatrelationsarecapturedinthegraphLet’susesomerelationtovisualizegraphs.HereIamchoosing: graph=ntx.from_pandas_edgelist(data_kgf[data_kgf['edge']=="Informationfrom"],"source","target", edge_attr=True,create_using=ntx.MultiDiGraph()) plot.figure(figsize=(14,14)) pos=ntx.spring_layout(graph,k=0.5)#kregulatesthedistancebetweennodes ntx.draw(graph,with_labels=True,node_color='green',node_size=1400,edge_cmap=plot.cm.Blues,pos=posn) plot.show() Seetheimageintheapp Onemoregraphfilteredwiththerelationname“links”canbefoundhere. Loggingmetadata IhaveloggedtheabovenetworkxgraphtoNeptune.Youcanfindthatparticularpath.Logyourimagetoadifferentpathdependingontheoutputobtained. run['graphs/all_in_graph'].upload('graph.png') run['graphs/filtered_relations'].upload('info.png') run['graphs/filtered_relations2'].upload('links.png') Allgraphscanbefoundhere. Coreferenceresolution Toobtainmorerefinedgraphsyoucanalsousethecoreferenceresolution. CoreferenceresolutionistheNLPequivalentofendophoricawarenessusedininformationretrievalsystems,conversationalagents,andvirtualassistantslikeAlexa.It’sataskofclusteringmentionsintextthatrefertothesameunderlyingentities. Source “I”,“my”,and“she”belongstothesamecluster,and“Joe”and“he”belongtothesamecluster. Algorithmsthatresolvecoreferencescommonlylookforthenearestprecedingmentionthat’scompatiblewiththereferringexpression.Insteadofusingrule-baseddependencyparsetrees,neuralnetworkscanalsobetrained,whichtakeintoaccountwordembeddingsanddistancebetweenmentionsasfeatures. Thissignificantlyimprovesentitypairextractionbynormalizingtext,removingredundancies,andassigningentitypronouns. Ifyourusecaseisdomain-specific,itwouldbeworthyourwhiletotrainacustomentityrecognitionmodel. Knowledgegraphscanbebuiltautomaticallyandexploredtorevealnewinsightsaboutthedomain.  NotebookuploadedtoNeptune. NotebookonGitHub. Knowledgegraphsatscale Toeffectivelyusetheentirecorpusof1749pagesforourtopic,usethecolumnscreatedinthewiki_scrapefunctiontoaddpropertiestoeachnode.Thenyoucantrackthepageandcategoryofeachnode.Youcanusemultiandparallelprocessingtoreduceexecutiontime.  SomeoftheusecasesofKGsare: Questionanswering,Storinginformation,Recommendationsystems,Supplychainmanagement. Challengesahead Entitydisambiguationandmanagingidentity Initssimplestform,thechallengeisassigningauniquenormalizedidentityandatypetoanutteranceoramentionofanentity.  Manyentitiesextractedautomaticallyhaveverysimilarsurfaceforms,suchaspeoplewiththesameorsimilarnames,ormovies,songs,andbookswiththesameorsimilartitles.Twoproductswithsimilarnamesmayrefertodifferentlistings.Withoutcorrectlinkinganddisambiguation,entitieswillbeincorrectlyassociatedwiththewrongfactsandresultinincorrectinferencedownstream. Typemembershipandresolution Mostknowledge-graphsystemstodayalloweachentitytohavemultipletypes,withspecifictypesfordifferentcircumstances.CubacanbeacountryoritcanrefertotheCubangovernment.Insomecases,knowledge-graphsystemsdeferthetypeassignmenttoruntime.Eachentitydescribesitsattributes,andtheapplicationusesaspecifictypeandcollectionofattributesdependingontheusertask. Checkalso ExploratoryDataAnalysisforNaturalLanguageProcessing:ACompleteGuidetoPythonTools Managingchangingknowledge Aneffectiveentity-linkingsystemneedstogroworganicallybasedonitsever-changinginputdata.Forexample,companiesmaymergeorsplit,andnewscientificdiscoveriesmaybreakasingleexistingentityintomultiple.  Whenacompanyacquiresanothercompany,doestheacquiringcompanychangeitsidentity?Doesidentityfollowtheacquisitionoftherightstoaname?Forexample,inthecaseofKGsconstructedinthehealthcareindustry,patientdatawillchangeoveraperiodoftime. Knowledgeextractionfrommultiplestructuredandunstructuredsources Theextractionofstructuredknowledge(whichincludesentities,theirtypes,attributes,andrelationships)remainsachallengeacrosstheboard.Growinggraphsatscalerequiremanualapproachesandunsupervisedandsemi-supervisedknowledgeextractionfromunstructureddatainopendomains. Managingoperationsatscale Managingscaleistheunderlyingchallengethataffectsseveraloperationsrelatedtoperformanceandworkloaddirectly.Italsomanifestsitselfindirectlyasitaffectsotheroperations,suchasmanagingfastincrementalupdatestolarge-scaleknowledgegraphs. Note:formoredetailsonhowdifferenttechgiantsimplementindustry-scaleknowledgegraphsintheirproductandrelated,challengescheckthisarticle. NaturalLanguageProcessing NaturalLanguageProcessing(NLP)isasubfieldofcomputerscienceconcernedwithenablingcomputerstoprocessandunderstandhumanlanguage.Technically,themaintaskofNLPwouldbetoprogramcomputersforanalyzingandprocessinghugeamountsofnaturallanguagedata. Languageisstudiedinvariousacademicdisciplines.Eachdisciplinecomeswithitsownsetofproblemsandasetofsolutionstoaddressthem. Ambiguityinlanguage AmbiguityusedinNLPcanbereferredtoastheabilitytobeunderstoodinmorethanoneway.Naturallanguageisambiguous.NLPhasthefollowingambiguities: Lexicalambiguityistheambiguityofasingleword.Forexample,thewordwellcanbeanadverb,noun,orverb.Syntacticambiguityisthepresenceof2ormorepossiblemeaningswithinasinglesentenceorsequenceofwords.Forexample“thechickenisreadyforconsumption”.Thissentenceeithermeansthechickeniscookedandcanbeeatennow,orthechickenisreadytobefed.Anaphoricambiguityisaboutreferencingbackward(orentityinanothercontext)inatext.Aphraseorwordreferstosomethingpreviouslymentioned,butthere’smorethanonepossibility.Forexample,“MargaretinvitedSusanforavisit,andshegaveheragoodmeal.”(she=Margaret;her=Susan). “MargaretinvitedSusanforavisit,butshetoldhershehadtogotowork”(she=Susan;her=Margaret.)Pragmaticambiguitycanbedefinedaswordsthathavemultipleinterpretations.Pragmaticambiguityariseswhenthemeaningofwordsofasentenceisnotspecific;itconcludeswithdifferentmeanings. TextsimilaritymetricsinNLP Textsimilarityisusedtodeterminehowsimilartwotextdocumentsareintermsoftheircontextormeaning.Therearevarioussimilaritymetrics,suchas:  Cosinesimilarity,Euclideandistance,JaccardSimilarity. Allthesemetricshavetheirownspecificationtomeasurethesimilaritybetweentwoqueries. Cosinesimilarity Cosinesimilarityisametricthatmeasurestextsimilaritybetween2documents,irrespectiveoftheirsize,inNLP.Awordisrepresentedinvectorform.Textdocumentsarerepresentedinn-dimensionalvectorspace. Cosinesimilaritymeasuresthecosineoftheanglebetweentwon-dimensionalvectorsprojectedinmultidimensionalspace.Thecosinesimilarityofthetwodocumentswillrangefrom0to1.Ifthecosinesimilarityscoreis1,Itmeans2vectorshavethesameorientation.Thevaluecloserto0indicatesthat2documentshavelesssimilarity. ThemathematicalequationofCosinesimilaritybetweentwonon-zerovectorsis: TheCosinesimilarityisabettermetricthanEuclideandistancebecauseiftwotextdocumentsarefarapartbyEuclideandistance,therearestillchancesthatthey’reclosetoeachotherintermsoftheircontext. Jaccardsimilarity JaccardSimilarityisalsoknownastheJaccardindexandIntersectionOverUnion. JaccardSimilarityisusedtodeterminethesimilaritybetweentwotextdocuments,i.e.howmanycommonwordsexistinallofthewords. Jaccardsimilarityisdefinedasanintersectionoftwodocumentsdividedbytheunionoftwodocumentsthatrefertothenumberofcommonwordsoveratotalnumberofwords. ThemathematicalrepresentationoftheJaccardSimilarityis: TheJaccardsimilarityscoreisinarangeof0to1.Iftwodocumentsareidentical,Jaccardsimilarityis1.TheJaccardsimilarityscoreiszeroiftherearenocommonwordsbetweenthetwodocuments. PythonCodetofindJaccardsimilarity defjaccard_similarity(doc1,doc2): #listuniquewordsinthedocument words_doc1=set(doc1.lower().split()) words_doc2=set(doc2.lower().split()) #findtheintersectionofwordslistofdoc1&doc2 intersection=words_doc1.intersection(words_doc2) #findtheunionofwordslistofdoc1&doc2 union=words_doc1.union(words_doc2) #CalculateJaccardsimilarityscore #usingthelengthofintersectionsetdividedbythelengthofunionset returnfloat(len(intersection))/len(union) docu_1="Workfromhomeisthenewnormalindigitalworld" docu_2="Workfromhomeisnormal" jaccard_similarity(docu_1,docu_2) Output:0.5 TheJaccardsimilaritybetweendoc_1anddoc_2is0.5 Thethreemethodsabovehavethesameassumption:thedocuments(orsentences)aresimilariftheyhavecommonwords.Thisideaisverystraightforward.Itfitssomebasiccasessuchascomparingthefirst2sentences.  However,thescorescanberelativelylowbycomparingthefirstandthirdsentences(forexample,tryingwithdifferentsentencesthatconveythesamemeaningandusingtheabovePythonfunctiontocomparesimilarity),eventhoughbothdescribethesamenews. Anotherlimitationisthattheabovemethodsdon’thandlesynonyms.Forexample‘buy’and‘purchase’shouldhavethesamemeaning(insomecases),buttheabovemethodswilltreatbothwordsdifferently. Sowhat’stheworkaround?Youcanusewordembeddings(Word2vec,GloVe,FastText). ForsomeofthebasicconceptsandusecasesofNLP,I’llbeattachingsomearticlesI’vewrittenonMedium,andoneonNeptune’sblogforreference: NaturalLanguageProcessingusingspaCyTopicModelingusingGensim-LDAinPythonSimilarityQueriesandTextSummarizationinNLPWordEmbeddingsinNLP|Word2Vec|GloVe|fastTextBuildingaSearchEngineWithPre-TrainedTransformers–AStepByStepGuide–thishasanexplanationofbuildingvector-basedsearchengines,andhowtofine-tuneBERTforthedownstreamtask Conclusion  Ihopeyou’velearnedsomethingnewhere,andthisarticlehelpedyouunderstandwebscraping,knowledgegraphs,andafewusefulNLPconcepts.  Thanksforreading,andkeeponlearning! References: https://studymachinelearning.com/jaccard-similarity-text-similarity-metric-in-nlp/https://cacm.acm.org/magazines/2019/8/238342-industry-scale-knowledge-graphs/fulltext?mobile=false AravindCR MachineLearningEngineeratOptiSolDataLabs Datascienceprofessionalwithexperienceinpredictivemodeling,dataprocessing,chatbotsanddataminingalgorithmstosolvechallengingbusinessproblems.PassionateaboutsolvingproblemsusingadvancedNaturalLanguageProcessingandMachineLearning. Followmeon READNEXT HowtoStructureandManageNaturalLanguageProcessing(NLP)Projects DhruvilKarani|PostedOctober12,2020 IfthereisonethingIlearnedworkingintheMLindustryisthis: machinelearningprojectsaremessy. Itisnotthatpeopledon’twanttohavethingsorganizeditisjusttherearemanythingsthatarehardtostructureandmanageoverthecourseoftheproject.  Youmaystartcleanbutthingscomeintheway.  Sometypicalreasonsare: quickdataexplorationsinNotebooks, modelcodetakenfromtheresearchrepoongithub, newdatasetsaddedwheneverythingwasalreadyset,dataqualityissuesarediscoveredandre-labelingofthedataisneeded,someoneontheteam“justtriedsomethingquickly”andchangedtrainingparameters(passedviaargparse)withouttellinganyoneaboutit,pushtoturnprototypesintoproduction“justthisonce”comingfromthetop. OvertheyearsworkingasamachinelearningengineerI’velearnedabunchof thingsthatcanhelpyoustayontopofthingsandkeepyourNLPprojectsincheck (asmuchasyoucanreallyhaveMLprojectsincheck:)).  InthispostIwillsharekeypointers,guidelines,tipsandtricksthatIlearnedwhileworkingonvariousdatascienceprojects.ManythingscanbevaluableinanyMLprojectbutsomearespecifictoNLP.  Continuereading-> 10ThingsYouNeedtoKnowAboutBERTandtheTransformerArchitectureThatAreReshapingtheAILandscape byCathalHoran Readmore pyLDAvis:TopicModellingExplorationToolThatEveryNLPDataScientistShouldKnow byKhuyenTran Readmore Training,Visualizing,andUnderstandingWordEmbeddings:DeepDiveIntoCustomDatasets byCathalHoran Readmore DataAugmentationinNLP:BestPracticesFromaKaggleMaster byShahulES Readmore TopMLOpsarticles,casestudies,events(andmore)inyourinboxeverymonth. GDPRcompliant.Privacypolicy. Neptune.aiusescookiestoensureyougetthebestexperienceonthiswebsite.Bycontinuingyouagreetoouruseofcookies.LearnmoreGotit!Manageconsent Close PrivacyOverview Thiswebsiteusescookiestoimproveyourexperiencewhileyounavigatethroughthewebsite.Outofthese,thecookiesthatarecategorizedasnecessaryarestoredonyourbrowserastheyareessentialfortheworkingofbasicfunctionalitiesofthewebsite.Wealsousethird-partycookiesthathelpusanalyzeandunderstandhowyouusethiswebsite.Thesecookieswillbestoredinyourbrowseronlywithyourconsent.Youalsohavetheoptiontoopt-outofthesecookies.Butoptingoutofsomeofthesecookiesmayaffectyourbrowsingexperience. Necessary Necessary AlwaysEnabled Necessarycookiesareabsolutelyessentialforthewebsitetofunctionproperly.Thesecookiesensurebasicfunctionalitiesandsecurityfeaturesofthewebsite,anonymously. CookieDurationDescriptioncookielawinfo-checbox-analytics11monthsThiscookieissetbyGDPRCookieConsentplugin.Thecookieisusedtostoretheuserconsentforthecookiesinthecategory"Analytics".cookielawinfo-checbox-functional11monthsThecookieissetbyGDPRcookieconsenttorecordtheuserconsentforthecookiesinthecategory"Functional".cookielawinfo-checbox-others11monthsThiscookieissetbyGDPRCookieConsentplugin.Thecookieisusedtostoretheuserconsentforthecookiesinthecategory"Other.cookielawinfo-checkbox-necessary11monthsThiscookieissetbyGDPRCookieConsentplugin.Thecookiesisusedtostoretheuserconsentforthecookiesinthecategory"Necessary".cookielawinfo-checkbox-performance11monthsThiscookieissetbyGDPRCookieConsentplugin.Thecookieisusedtostoretheuserconsentforthecookiesinthecategory"Performance".viewed_cookie_policy11monthsThecookieissetbytheGDPRCookieConsentpluginandisusedtostorewhetherornotuserhasconsentedtotheuseofcookies.Itdoesnotstoreanypersonaldata. Functional Functional Functionalcookieshelptoperformcertainfunctionalitieslikesharingthecontentofthewebsiteonsocialmediaplatforms,collectfeedbacks,andotherthird-partyfeatures. Performance Performance Performancecookiesareusedtounderstandandanalyzethekeyperformanceindexesofthewebsitewhichhelpsindeliveringabetteruserexperienceforthevisitors. Analytics Analytics Analyticalcookiesareusedtounderstandhowvisitorsinteractwiththewebsite.Thesecookieshelpprovideinformationonmetricsthenumberofvisitors,bouncerate,trafficsource,etc. Advertisement Advertisement Advertisementcookiesareusedtoprovidevisitorswithrelevantadsandmarketingcampaigns.Thesecookiestrackvisitorsacrosswebsitesandcollectinformationtoprovidecustomizedads. Others Others Otheruncategorizedcookiesarethosethatarebeinganalyzedandhavenotbeenclassifiedintoacategoryasyet. SAVE&ACCEPT   LoadingComments...   WriteaComment... Email(Required) Name(Required) Website



請為這篇文章評分?