An Introduction to Knowledge Graphs | SAIL Blog
文章推薦指數: 80 %
A knowledge graph is a directed labeled graph in which we have associated domain specific meanings with nodes and edges. Anything can act as a ... KnowledgeGraphs(KGs)haveemergedasacompellingabstractionfororganizingtheworld’sstructuredknowledge,andasawaytointegrateinformationextractedfrommultipledatasources.Knowledgegraphshavestartedtoplayacentralroleinrepresentingtheinformationextractedusingnaturallanguageprocessingandcomputervision.DomainknowledgeexpressedinKGsisbeinginputintomachinelearningmodelstoproducebetterpredictions.Ourgoalsinthisblogpostareto(a)explainthebasicterminology,concepts,andusageofKGs,(b)highlightrecentapplicationsofKGsthathaveledtoasurgeintheirpopularity,and(c)situateKGsintheoveralllandscapeofAI.Thisblogpostisagoodstartingpointbeforereadingamoreextensivesurveyorfollowingresearchseminarsonthistopic. KnowledgeGraphDefinition Adirectedlabeledgraphisa4-tupleG=(N,E,L,f),whereNisasetofnodes,E⊆N×Nisasetofedges,Lisasetoflabels,andf:E→L,isanassignmentfunctionfromedgestolabels.AnassignmentofalabelBtoanedgeE=(A,C)canbeviewedasatriple(A,B,C)andvisualizedasshowninFigure1. Aknowledgegraphisadirectedlabeledgraphinwhichwehaveassociateddomainspecificmeaningswithnodesandedges.Anythingcanactasanode,forexample,people,company,computer,etc.Anedgelabelcapturestherelationshipofinterestbetweenthenodes,forexample,afriendshiprelationshipbetweentwopeople,acustomerrelationshipbetweenacompanyandperson,oranetworkconnectionbetweentwocomputers,etc. Thedirectedlabeledgraphrepresentationisusedinavarietyofwaysdependingontheneedsofanapplication.Adirectedlabeledgraphsuchastheoneinwhichthenodesarepeople,andtheedgescapturetheparentrelationshipisalsoknownasadatagraph.Adirectedlabeledgraphinwhichthenodesareclassesofobjects(e.g.,Book,Textbook,etc.),andtheedgescapturethesubclassrelationship,isalsoknownasataxonomy.Insomedatamodels,givenatriple(A,B,C),werefertoA,B,Casthesubject,thepredicate,andtheobjectofthetriplerespectively. Aknowledgegraphservesasadatastructureinwhichanapplicationstoresinformation.Theinformationcouldbeaddedtotheknowledgegraphthroughacombinationofhumaninput,automatedandsemi-automatedmethods.Regardlessofthemethodofknowledgeentry,itisexpectedthattherecordedinformationcanbeeasilyunderstoodandverifiedbyhumans. Manyinterestingcomputationsoveragraphcanbereducedtonavigatingit.Forexample,inafriendshipKG,tocalculatethefriendsoffriendsofapersonA,wecannavigatethegraphfromAtoallnodesBconnectedtoitbyarelationlabeledasfriend,andthenrecursivelytoallnodesCconnectedbythefriendrelationtoeachB. RecentApplicationsofKnowledgeGraphs Useofdirectedlabeledgraphsasadatastructureforstoringinformation,andtheuseofgraphalgorithmstomanipulatethatinformationisnotnew.Withincomputerscience,therehavebeenmanyusesofadirectedgraphrepresentation,forexample,dataflowgraphs,binarydecisiondiagrams,statecharts,etc.Weconsiderheretwoconcreteapplicationsthathaveledtoarecentsurgeinthepopularityofknowledgegraphs:organizinginformationovertheinternetanddataintegrationinenterprises.Whilediscussingtheseapplications,wealsohighlightwhatisnewanddifferentintheuseofknowledgegraphs. OrganizingKnowledgeovertheInternet ConsidertheGooglesearchfor“WinterthurZurich”whichreturnstheresultshownintheleftpanelofFigure2andarelevantportionfromWikipediainthepanelontheright.TheportionoftheWikipediapageshowninthepanelontherightisalsoknownasanInfobox. Figure2:Anexampleuseofaknowledgegraphintheresultsofawebsearch Aspartofthesearchresults,weseefactssuchasWinterthurisinthecountrySwitzerland,itselevationis430meters,etc.ThisinformationisdirectlyextractedfromtheInfoboxesfromtheWikipediapageforWinterthur.SomeofthedataintheWikipediaInfoboxesispopulatedbyqueryingaKGcalledWikidata.ThedatafromaKGcanenhancethewebsearchinevendeeperwaysthanillustratedintheaboveexample,aswenextdiscuss. TheWikipediapageforWinterthurlistsitstwintowns:twoareinSwitzerland,oneinCzechRepublic,andoneinAustria.ThecityofOntarioinCaliforniathathasaWikipediapagetitled,Ontario,California,listsWinterthurasitssistercity.Sistercityandtwincityrelationshipsareidenticalaswellasreciprocal.Thus,ifacityAisasister(twin)cityofanothercityB,thenBmustbeasister(twin)cityofA.As“Sistercities”and“Twintowns”aresectionheadingsinWikipedia,withnodefinitionorrelationshipspecifiedbetweenthetwo,itisdifficulttodetectthisdiscrepancy.Incontrast,intheWikidatarepresentationofWinterthur,thereisarelationshipcalledtwinnedadministrativebodythatliststhecityofOntario.AsthisrelationshipisdefinedtobeasymmetricrelationshipintheKG,theWikidatapageforthecityofOntarioautomaticallyincludesWinterthur.Wikidatasolvestheproblemofidentifyingequivalentrelationshipsthroughtheeffortofitscurators,andbyusingaKGasastorageandinferencemechanism.TothedegreetheWikidataKGisfullyintegratedintoWikipedia,thediscrepanciesofmissinglinksconsideredintheexampleconsideredherewillnaturallydisappear.WecanvisualizethetwowayrelationshipbetweenWinterthurandOntarioinFigure3.TheKGinFigure3alsoshowsotherobjectstowhichWinterthurandOntarioareconnected. Figure3:AfragmentoftheWikidataknowledgegraph WikidataincludesdatafromseveralindependentproviderssuchastheLibraryofCongress.ByusingtheWikidataidentifierforWinterthur,theinformationreleasedbytheLibraryofCongresscanbeeasilylinkedwithotherinformationaboutWinterthurpresentinWikidata.WikidatamakesiteasytoestablishsuchlinksbypublishingthedefinitionsofrelationshipsusedinitinSchema.Org. Awell-documentedlistofrelationsinSchema.Org,alsoknownastherelationvocabulary,givesus,atleast,twoadvantages.First,itiseasiertowritequeriesthatspanacrossmultipledatasetsbecausequeriescanbeframedusingrelationsthatarecommontothosesources.Withouttheusageofsuchcommonrelationshipsacrossmultiplesources,wewouldneedtodeterminesemanticrelationshipsbetweenthemandprovideappropriatetranslations.Oneexampleofaquerythatgoesacrossmultiplesourcesis:DisplayonamapthebirthcitiesofpeoplewhodiedinWinterthour?Second,searchenginescanusesuchqueriestoretrieveinformationfromtheKGanddisplaythequeryresultsasshowninFigure2.Useofstructuredinformationreturnedinthesearchresultsisnowastandardfeaturefortheleadingsearchengines. ArecentversionofWikidatahadover90millionobjects,withoveronebillionrelationshipsamongthoseobjects.Wikidatamakesconnectionsacrossover4872differentcatalogsin414differentlanguagespublishedbyindependentdataproviders.Asperarecentestimate,31%ofthewebsites,andover12milliondataprovidersarecurrentlyusingthevocabularyofSchema.Orgtopublishannotationstotheirwebpages. WhatisparticularlynewandexcitingabouttheWikidataknowledgegraph?First,itisagraphofunprecedentedscale,andisoneofthelargestknowledgegraphsavailabletoday.Second,eventhoughWikidataismanuallycurated,thecostofcurationissharedbyacommunityofcontributors.Third,someofthedatainWikidatamaycomefromautomaticallyextractedinformation,butitmustbeeasilyunderstoodandverifiedaspertheWikidataeditorialpolicies.Fourth,thereisanexplicitefforttoprovidesemanticdefinitionsofdifferentrelationnamesthroughthevocabularyinSchema.Org.Finally,theprimarydrivingusecaseforWikidataistoimprovethewebsearch.EventhoughWikidatahasseveralapplicationsusingitforanalysisandvisualization,itsuseoverthewebcontinuestobethemostcompellingandeasilyunderstoodapplication. DataIntegrationinEnterprises Figure4:360-degreeviewofacustomeriscreatedbyintegratingexternaldatawithinternalcompanyinformation Manyfinancialinstitutionsareinterestedinbettermanagingtheircustomerrelationshipsthrougha360-degreeview,i.e.,aviewthatintegratesexternalinformationaboutacustomerwithinternalinformationaboutthesamecustomer.Forexample,onecanintegratepubliclyavailableinformationfromfinancialnews,commerciallysourcedandcurateddataaboutsupplychainrelationshipswithinternalcustomerinformationtocreatesucha360-degreeview.Tounderstandhowsuchaviewisuseful,letusconsideranexamplescenario.Financialnewsreportsthat“AcmaRetailInc’’hasfiledforbankruptcyduetothepandemic,becauseofwhichmanyofitssupplierswillfacefinancialstress.Suchstresscanpassdeepdownintoitssupplychainandtriggerfinancialdifficultiesforotherclients.Forexample,ifacompanyAwhoisasupplierforAcmaisundergoingfinancialstress,asimilarstresswillbeexperiencedbycompanieswhoaresuppliersofA.SuchsupplychainrelationshipsarecuratedaspartofacommerciallyavailabledatasetcalledFactset.Ina360-degreeview,thedatafromFactsetandthefinancialnewsareintegratedwiththeinternalcustomerdatabases.TheresultingKGaccuratelytracksAcmasupplychain,identifiesstressedsupplierswithdifferentrevenueexposure,andidentifiescompanieswhoseriskmaybeworthmonitoring. Tocreatethe360-degreeviewofacustomer,thedataintegrationprocessbeginswithbusinessanalystssketchingoutaschemaofthekeyentities,eventsandtherelationshipstheyareinterestedintracking.ThevisualnatureoftheKGschemasmakesiteasierforthebusinessexpertstoengageandspecifytheirrequirements.Thedatafromtheindividualsourcesisthenloadedintoaknowledgegraphengine.Thestorageformatoftriplesallowsustotranslateonlythoserelationshipsthatareofimmediaterelevancetotheschemadefinedbythebusinessdomainexperts.Restofthedatacanstillbeloadedastriplesbutdoesnotrequireustoincurtheupfrontcostofrelatingthemintothedefinedschema.AstheKGsuseagenericschemaoftriples,changingrequirementsduringtheanalysisprocessareeasiertoincorporate.Finally,thestorageformatmirrorstheschemathatthedomainexpertsdefine. Whatisparticularlynewandexcitingabouttheuseofknowledgegraphsfordataintegration?First,agenericschemaoftriplessubstantiallyreducesthecostofstartingwithadataintegrationproject.Second,itismucheasiertoadaptatriple-basedschemainresponsetochangesthanthecomparableeffortrequiredtoadaptatraditionalrelationaldatabase.Third,andfinally,modernKGenginesarehighlyoptimizedforansweringquestionsthatrequiretraversingthegraphrelationshipsinthedata.FortheexampleschemaofFigure5,agraphenginehasbuiltinoperationstoidentifythecentralsuppliersinasupplychainnetwork,closelyrelatedgroupsofcustomersorsuppliers,andspheresofinfluenceofdifferentsuppliers.Allofthesecomputationsleveragedomain-independentgraphalgorithmssuchascentralitydetectionandcommunitydetection.Becauseofeaseofcreatingandvisualizingtheschema,andthebuiltinanalyticsoperations,KGsarebecomingapopularsolutionforturningdataintointelligence. KnowledgeGraphsinArtificialIntelligence AIagentsmaintainrepresentationsoftherealworldandusethemforreasoning.ComingupwithagoodrepresentationisaproblemcentraltoAIasitallowsanagenttostoreinformationandderivenewconclusionsfromit.WebeginthissectionbyaquickreviewofthepreviousworkonknowledgerepresentationinAI,situateKGswithinthatcontext,andthenprovidemoredetailsabouthowthemodernAIalgorithmsuseKGstostoretheiroutputaswellasconsumethemtoincorporatedomainknowledge. Knowledgegraphs,alsoknownassemanticnetworksinthecontextofAI,havebeenusedasastoreofworldknowledgeforAIagentssincetheearlydaysofthefield,andhavebeenappliedinallareasofcomputerscience.Therearemanyotherschemesthatparallelsemanticnetworks,suchasconceptualgraphs,descriptionlogics,andrulelanguages.Insomecases,probabilisticgraphicalmodelscancaptureuncertainknowledge. Awidelyknownapplicationofapproachesthatoriginatedfromsemanticnetworksisincapturingontologies.Anontologyisaformalspecificationoftherelationshipsthatareusedinaknowledgegraph.Forexample,inFigure3,theconceptssuchasCity,Country,etc.andrelationshipssuchaspartof,sameas,etc,andtheirformaldefinitionsconstituteanontology.Usingthisontology,wecandrawinferencessuchasWinterthurislocatedinSwitzerland. Tomaketheinternetmoreintelligent,theWorldWideWebConsortium(W3C)standardizedafamilyofknowledgerepresentationlanguagesthatarenowwidelyusedforcapturingknowledgeontheinternet.TheselanguagesincludetheResourceDescriptionFramework(RDF),theWebOntologyLanguage(OWL),andtheSemanticWebRuleLanguage(SWRL). ThepriorworkonknowledgerepresentationinAIthatwehavejustreviewedhasbeendriveninatop-downmanner,thatis,wefirstdevelopamodeloftheworld,andthenusereasoningalgorithmstodrawconclusionsfromthem.Currently,thereisasurgeofactivityonbottomupapproachestoAI,thatis,developingalgorithmsthatcanprocessthedatafromwhichalgorithmscandrawconclusionsandinsights.Fortherestofthesection,wewilldiscusstheroleKGsareplayingbothinstoringthelearnedknowledge,andinprovidingasourceofdomainknowledgeinputtotheAIalgorithms. KnowledgeGraphsastheoutputofMachineLearning EventhoughWikidatahashadsuccessinengagingacommunityofvolunteercurators,manualcreationofknowledgegraphsis,ingeneral,expensive.Therefore,anyautomationwecanachieveforcreatingaknowledgegraphishighlydesired.Untilafewyearsago,bothnaturallanguageprocessing(NLP)andcomputervision(CV)algorithmswerestrugglingtodowellonentityrecognitionfromtextandobjectdetectionfromimages.Becauseofrecentprogress,thesealgorithmsarestartingtomovebeyondthebasicrecognitiontaskstoextractingrelationshipsamongobjectsnecessitatingarepresentationinwhichtheextractedrelationscouldbestoredforfurtherprocessingandreasoning.WewillnowdiscusshowtheautomationpossiblethroughNLPandCVtechniquesisfacilitatingthecreationofknowledgegraphs. EntityextractionandrelationextractionfromtextaretwofundamentaltasksinNLP.Methodsforperformingentityandrelationextractionincluderule-basedmethods,andmachinelearning.Therule-basedapproachesleveragethesyntacticalstructureofthesentenceorspecifyhowentitiesorrelationshipscouldbeidentifiedintheinputtext.Themachinelearningapproachesleveragesequencelabelingalgorithmsorlanguagemodelsforbothentityandrelationextraction. Theextractedinformationfrommultipleportionsofthetextneedstobecorrelated,andknowledgegraphsprovideanaturalmediumtoaccomplishsuchagoal.Forexample,fromthesentenceshowninFigure6,wecanextracttheentitiesAlbertEinstein,Germany,TheoreticalPhysicist,andTheoryofRelativity;andtherelationsbornin,occupationanddeveloped.OncethissnippetofknowledgeisincorporatedintoalargerKG,wecanuselogicalinferencetogetadditionallinks(shownbydottededges)suchasaTheoreticalPhysicistisakindofPhysicistwhopracticesPhysics,andthatTheoryofRelativityisabranchofPhysics. Aholygrailofcomputervisionisthecompleteunderstandingofanimage,thatis,detectingobjects,describingtheirattributes,andrecognizingtheirrelationships.Understandingimageswouldenableimportantapplicationssuchasimagesearch,questionanswering,androboticinteractions.Muchprogresshasbeenmadeinrecentyearstowardsthisgoal,includingimageclassificationandobjectdetection.Computervisionalgorithmsmakeheavyuseofmachinelearningmethodssuchasclassification,clustering,nearestneighbors,andthedeeplearningmethodssuchasrecurrentneuralnetworks. FromtheimageshowninFigure7,animageunderstandingsystemshouldproduceaKGshowntotheright.Thenodesintheknowledgegrapharetheoutputsofanobjectdetector.Currentresearchincomputervisionfocusesondevelopingtechniquesthatcancorrectlyinfertherelationshipsbetweentheobjects,suchas,manholdingabucket,andhorsefeedingfromthebucket,etc.TheKGshowntotherightisanexampleofaknowledgegraphwhichprovidesfoundationforvisualquestionanswering. KnowledgeGraphsasinputtoMachineLearning Machinelearningalgorithmscanperformbetteriftheycanincorporatedomainknowledge.KGsareausefuldatastructureforcapturingdomainknowledge,butmachinelearningalgorithmsrequirethatanysymbolicordiscretestructure,suchasagraph,shouldfirstbeconvertedintoanumericalform.Wecanconvertsymbolicinputsintoanumericalformusingatechniqueknownasembeddings.Toillustratethis,wewillconsiderwordembeddingsandgraphembeddings. Wordembeddingswereoriginallydevelopedforcalculatingsimilaritybetweenwords.Tounderstandthewordembeddings,weconsiderthefollowingsetofsentences. Ilikeknowledgegraphs. Ilikedatabases. Ienjoyrunning. Intheabovesetofsentences,wecounthowoftenawordappearsnexttoanotherwordandrecordthecountsinamatrix.Forexample,thewordIappearsnexttothewordliketwice,andnexttothewordenjoyonce,andtherefore,itscountsforthesetwowordsare2and1respectively,and0foreveryotherword.WecancalculatethecountsfortheotherwordsinasimilarmannerasshowninTable1.Suchamatrixisoftenreferredtoaswordco-occurrencecounts.Themeaningofeachwordiscapturedbythevectorintherowcorrespondingtothatword.Tocalculatesimilaritybetweenwords,wecalculatethesimilaritybetweenthevectorscorrespondingtothem.Inpractice,weareinterestedintextthatmaycontainmillionsofwords,andamorecompactrepresentationisdesired.Astheco-occurrencematrixissparse,wecanusetechniquesfromLinearAlgebra(e.g.,singularvaluedecomposition)toreduceitsdimensions.Theresultingvectorcorrespondingtoawordisknownasitswordembedding.Typicalwordembeddingsinusetodayrelyonvectorsoflength200. counts I like enjoy Knowledge graphs database running . I 0 2 1 0 0 0 0 0 like 2 0 0 1 0 1 0 0 enjoy 1 0 0 0 0 0 1 0 knowledge 0 1 0 0 1 0 0 0 graphs 0 0 0 1 0 0 0 1 databases 0 1 0 0 0 0 0 1 running 0 0 1 0 0 0 0 1 . 0 0 0 0 1 1 1 0 Table1:Matrixofco-occurrencecounts Asentenceisasequenceofwords,andwordembeddingscalculateco-occurrencesofwordsinit.Wecangeneralizethisideatonodeembeddingsforagraphinthefollowingmanner:(a)traversethegraphusingarandomwalkgivingusapaththroughthegraph(b)obtainasetofpathsthroughrepeatedtraversalsofthegraph(c)calculateco-occurrencesofnodesonthesepathsjustlikewecalculatedco-occurrencesofwordsinasentence(d)eachrowinthematrixofco-occurrencecountsgiveusavectorforthenodecorrespondingtoit(e)usesuitabledimensionalityreductiontechniquestoobtainasmallervectorwhichisreferredtoasanodeembedding. Wecanencodethewholegraphintoavectorwhichisknownasitsgraphembedding.Therearemanyapproachestocalculategraphembeddings,butperhaps,thesimplestapproachistoaddthevectorsrepresentingnodeembeddingsforeachofthenodesinthegraphtoobtainavectorrepresentingthewholegraph. Weusedtheexampleofwordembeddingsasprecursortoexplaininggraphembeddingsprimarilyforpedagogicalpurposes.Indeed,bothhavesimilarobjectives:whilewordembeddingscapturethemeaningofwordsandhelpcalculatesimilaritybetweenthem,nodeembeddingscapturethemeaningofnodesinagraphandhelpcalculatesimilaritybetweenthem.Thereisalsoagreatdealofsimilaritybetweenthemethodsusedforcalculatingthem. Wordembeddingsandgraphembeddingsareawaytogiveasymbolicinputtoamachinelearningalgorithm.Acommonapplicationofwordembeddingsistolearnalanguagemodelthatcanpredictwhatwordislikelytoappearnextinasequenceofwords.AmoreadvancedapplicationofwordembeddingsistousethemwithaKG–forexample,theembeddingforamorefrequentwordcouldbereusedforalessfrequentwordaslongastheknowledgegraphencodesthatthelessfrequentwordisitshyponym.Astraightforwarduseforthegraphembeddingscalculatedfromafriendshipgraphistorecommendnewfriends.Amoreadvanceduseofgraphembeddinginvolveslinkprediction,forexample,inacompanygraph,wecanuselinkpredictiontoidentifypotentialnewcustomers. Summary Adirectedlabeledgraphisafundamentalconstructindiscretemathematics,andhasapplicationsinallareasofcomputerscience.MostnotableusesofdirectedlabeledgraphsinAIanddatabaseshavetakentheformofdatagraphs,taxonomiesandontologies.Traditionally,suchapplicationshavebeensmallandhavebeencreatedbyatopdowndesignandthroughmanualknowledgeengineering. Distinguishingcharacteristicsofthemodernknowledgegraphsfromtheclassicalknowledgegraphsare:scale,bottomupdevelopmentandmultiplemodesofconstruction.TheearlysemanticnetworksinAIneverreachedthesizeandscaleoftheknowledgegraphsthatweseetoday.Difficultyincomingupwithatopdownschemadesignfordataintegrationandthedatadrivennatureofmachinelearninghaveforcedabottomupmethodologyforcreatingtheknowledgegraphs.Finally,forcreatingmodernknowledgegraphswearesupplementingmanualknowledgeengineeringtechniqueswithsignificantautomationandcrowdsourcing. Theconfluenceoftheabovetrendsestablishesanewimportanceforthetheoryandalgorithmsforclassicalknowledgegraphs.Evenwhenwecreateaknowledgegraphinabottomupmanner,thedesignofitsschemaanditssemanticdefinitionarestillimportant.Whileautomationmayspeedupsomestepsforcreatingaknowledgegraph,manualvalidationandhumanoversightarestillessential.Thissynergysetsupanexcitingunchartedfrontierforjointlyleveragingclassicalknowledgegraphtechniquesandmoderntoolsofmachinelearning,crowdsourcing,andscalablecomputing. KeepontopofthelatestSAILBlogpostsviaRSS,Twitter,oremail: Share ShareonFacebook Tweet AddtoPocket ShareonReddit Email Previouspost StanfordAILabPapersandTalksatICLR2021 Nextpost ExtrapolatingtoUnnaturalLanguageProcessingwithGPT-3'sIn-contextLearning:TheGood,theBad,andtheMysterious
延伸文章資訊
- 1Construct a biomedical knowledge graph with NLP
We will go through the following steps to construct a knowledge graph: Reading a PDF document wit...
- 2An Introduction to Knowledge Graphs | SAIL Blog
A knowledge graph is a directed labeled graph in which we have associated domain specific meaning...
- 3Knowledge Graph & NLP Tutorial-(BERT,spaCy,NLTK) | Kaggle
A knowledge graph is a way of storing data that resulted from an information extraction task. Man...
- 4Knowledge Graphs in Natural Language Processing @ ACL ...
Knowledge Graphs in Natural Language Processing @ ACL 2021. Your guide to the KG-related NLP rese...
- 5NLP Applications — 11 applications of Knowledge Graphs
Knowledge graphs (KGs), i.e. representation of information as a semantic graph, got wide consider...