Knowledge graph completion with PyKEEN and Neo4j
文章推薦指數: 80 %
PyKEEN is a Python library that features knowledge graph embedding models and simplifies multi-class link prediction task executions. OpeninappHomeNotificationsListsStoriesWritePublishedinTowardsDataScienceKnowledgegraphcompletionwithPyKEENandNeo4jIntegratePyKEENlibrarywithNeo4jformulti-classlinkpredictionusingknowledgegraphembeddingmodelsAcoupleofweeksago,ImetFrancoisVanderseypen,aGraphDataScienceconsultant.WedecidedtojoinforcesandstartaGraphMachinelearningblogseries.Thisblogpostwillpresenthowtoperformknowledgegraphcompletion,whichissimplyamulti-classlinkprediction.Insteadofjustpredictingalink,wearealsotryingtopredictitstype.Knowledgegraphcompletionexample.Imagebytheauthor.Forknowledgegraphcompletion,theunderlyinggraphshouldcontainmultipletypesofrelationships.Otherwise,ifyouaredealingwithonlyasinglekindofrelationship,youcanusethestandardlinkpredictiontechniquesthatdonotconsidertherelationshiptype.Theexamplevisualizationhasonlyasinglenodetype,butinpractice,yourinputgraphcanconsistsofmultiplenodetypesaswell.Wehavetousetheknowledgegraphembeddingmodelsforamulti-classlinkpredictionpipelineinsteadofplainnodeembeddingmodels.What’sthedifference,youmayask.Whilenodeembeddingmodelsembedonlynodes,theknowledgegraphembeddingmodelsembedbothnodesandrelationships.Embeddingnodesandrelationshipsviaknowledgegraphembeddingmodels.Imagebyauthor.Thestandardsyntaxtodescribethepatternisthatthestartingnodeiscalledhead(h),theendortargetnodeisreferredtoastail(t),andtherelationshipisr.TheintuitionbehindtheknowledgegraphembeddingmodelsuchasTransEisthattheembeddingoftheheadplustherelationshipisclosetotheembeddingofthetailiftherelationshipispresent.Imagebytheauthor.Thepredictionsarethenquitesimple.Forexample,ifyouwanttopredictnewrelationshipsforaspecificnode,youjustsumthenodeplustherelationshipembeddingandevaluateifanyofthenodesareneartheembeddingsum.Formoredetailedinformationaboutknowledgegraphembeddingmodels,IsuggestyoucheckoutthefollowinglecturebyJureLeskovec.AgendaIfyoureadanyofmypreviousblogposts,youmightknowthatIliketouseNeo4j,anativegraphdatabase,tostoredata.YouwillthenusetheNeo4jPythondrivertofetchthedataandtransformitintoaPyKEENgraph.PyKEENisaPythonlibrarythatfeaturesknowledgegraphembeddingmodelsandsimplifiesmulti-classlinkpredictiontaskexecutions.Lastly,youwillstorethepredictionsbacktoNeo4jandevaluatetheresults.IhavepreparedaJupyternotebookthatcontainsallthecodeinthispost.PreparethedatainNeo4jDesktopWewillbeusingasubsetoftheHetionetdataset.Ifyouwanttolearnmoreaboutthedataset,checkouttheoriginalpaper.Tofollowalongwiththistutorial,IrecommendyoudownloadtheNeo4jDesktopapplication.Neo4jGraphDataPlatform—TheLeaderinGraphDatabasesIntroducingNeo4jAuraDBFreeTheWorld’sLeadingGraphDatabaseNowAvailableasanAlwaysFreeCloudServiceNew!neo4j.comOnceyouhaveinstalledtheNeo4jDesktop,youcandownloadthedatabasedumpanduseittorestoreadatabaseinstance.RestoreadatabasedumpinNeo4jDesktop.Imagebyauthor.Ifyouneedabitmorehelpwithrestoringthedumpfile,I’vewrittenablogpostaboutitaboutayearago.Ifyouhavesuccessfullyrestoredthedatabasedump,youcanopentheNeo4jBrowserandexecutethefollowingcommand.CALLdb.schema.visualization()Theprocedureshouldvisualizethefollowinggraphschema.Graphmodel.Imagebyauthor.OursubsetoftheHetionetgraphcontainsgenes,compounds,anddiseases.Therearemanyrelationshipsbetweenthem,andyouwouldprobablyneedtobeinthebiomedicaldomaintounderstandthem,soIwon’tgointodetails.Inourcase,themostimportantrelationshipisthetreatsrelationshipbetweencompoundsanddiseases.Thisblogpostwillusetheknowledgegraphembeddingmodelstopredictnewtreatsrelationships.Youcouldthinkofthisscenarioasadrugrepurposingtask.PyKEENPyKEENisanincredible,simple-to-uselibrarythatcanbeusedforknowledgegraphcompletiontasks.Currently,itfeatures35knowledgegraphembeddingmodelsandevensupportsout-of-the-boxhyper-parameteroptimizations.Ilikeitduetoitshigh-levelinterface,makingitveryeasytoconstructaPyKEENgraphandtrainanembeddingmodel.CheckoutitsGitHubrepositoryformoreinformation.GitHub-pykeen/pykeen:🤖APythonlibraryforlearningandevaluatingknowledgegraphembeddingsPyKEEN(PythonKnowlEdgeEmbeddiNgs)isaPythonpackagedesignedtotrainandevaluateknowledgegraphembedding…github.comTransformaNeo4jtoaPyKEENgraphNowwewillmoveontothepracticalpartofthispost.First,wewilltransformtheNeo4jgraphtothePyKEENgraphandsplitthetrain-testdata.Tobegin,wehavetodefinetheconnectiontotheNeo4jdatabase.Therun_queryfunctionexecutesaCypherqueryandreturnstheoutputintheformofaPandasdataframe.ThePyKEENlibraryhasafrom_labeled_triplesthattakesalistoftriplesasaninputandconstructsagraphfromit.ThisexamplehasagenericCypherquerythatcanbeusedtofetchanyNeo4jdatasetandconstructaPyKEENfromit.NoticethatweusetheinternalNeo4jidsofnodestobuildthetriplesdataframe.Forsomereason,thePyKEENlibraryexpectsthetripleelementstobeallstrings,sowesimplycasttheinternalidstostring.Learnmoreabouthowtoconstructthetriplesandtheavailableparametersintheofficialdocumentation.NowthatwehaveourPyKEENgraph,wecanusethesplitmethodtoperformthetrain-testdatasplit.Itcouldn’tgetanyeasierthanthis.ImustcongratulatethePyKEENauthorsfordevelopingsuchastraightforwardinterface.TrainaknowledgegraphembeddingmodelNowthatwehavethetrain-testdataavailable,wecangoaheadandtrainaknowledgegraphembeddingmodel.WewillusetheRotatEmodelinthisexample.Iamnotthatfamiliarwithallthevariationsoftheembeddingmodels,butifyouwanttolearnmore,IwouldsuggestthelecturebyJureLeskovecIlinkedabove.Wewon’tperformanyhyper-parameteroptimizationtokeepthetutorialsimple.I’vechosentouse20epochsanddefinedthedimensionsizetobe512.p.s.I’velaterlearnedthat20epochsprobablyisn’tenoughtogetmeaningfultrainingonalarge,complexgraph;especiallywithsuchahighdimensionality.Multi-classlinkpredictionThePyKEENlibrarysupportsmultiplemethodsformulti-classlinkprediction.YoucouldfindthetopKpredictionsinthenetwork,oryoucanbemorespecificanddefineaparticularheadnodeandrelationshiptypeandevaluateifthereareanynewconnectionspredicted.Inthisexample,youwillpredictnewtreatsrelationshipsfortheL-Asparaginecompound.Becauseweusedtheinternalnodeidsformapping,wefirsthavetoretrievethenodeidofL-AsparaginefromNeo4jandinputitintothepredictionmethod.StorepredictionstoNeo4jForeasierevaluationoftheresults,wewillstorethetopfivepredictionsbacktoNeo4j.YoucannowopentheNeo4jBrowserandrunthefollowingCypherstatementtoinspecttheresults.MATCHp=(:Compound)-[:PREDICTED_TREATS]->(d:Disease)RETURNpResultsPredictedtreatsrelationshipbetweenL-Asparagineandtopfivediseases.Imagebytheauthor.AsIamnotamedicaldoctor,Ican’tsayifthepredictionsmakesenseornot.Inthebiomedicaldomain,linkpredictionispartofthescientificprocessofgeneratinghypothesesandnotblindlybelievingtheresults.ExplainingpredictionsAsfarasIknow,theknowledgegraphembeddingmodelisnotthatusefulforexplainingpredictions.Ontheotherhand,youcouldusetheexistingconnectionsinthegraphtopresenttheinformationtoamedicaldoctorandlethimdecideifthepredictionsmakesenseornot.Forexample,youcouldinvestigatedirectandindirectpathsbetweenL-AsparagineandcoloncancerwiththefollowingCypherquery.MATCH(c:Compound{name:"L-Asparagine"}),(d:Disease{name:"coloncancer"})WITHc,dMATCHp=AllShortestPaths((c)-[r:binds|regulates|interacts|upregulates|downregulates|associates*1..4]-(d))RETURNpLIMIT25ResultsIndirectpathsbetweenL-Asparagineandcoloncancer.ImagebytheauthorOntheleftside,wehavethecoloncancer,andontherightsidethereistheL-Asparaginenode.Inthemiddleofthevisualizationtherearegenesthatconnectthetwonodes.Outofcuriosity,I’vegoogledL-Asparagineincombinationwithcoloncancerandcameacrossthisarticlefrom2019.SOX12promotescolorectalcancercellproliferationandmetastasisbyregulatingasparagine…Thesex-determiningregionY(SRY)-box(SOX)familyhasacrucialroleincarcinogenesisandcancerprogression…www.ncbi.nlm.nih.govWhilemylayman’seyesdon’treallycomprehendifasparagineshouldbeincreasedordecreasedtohelpwiththedisease,itatleastlookslikethereseemstobearelationbetweenthetwo.ConclusionMostofthetime,youdealwithgraphswithmultiplerelationshiptypes.Therefore,knowledgegraphembeddingmodelsarehandyformulti-classlinkpredictiontasks,whereyouwanttopredictanewlinkanditstype.Forexample,thereisabigdifferenceifthepredictedlinktypeistreatsorcauses.ThetransformationfromNeo4jtoPyKEENgraphisgenericandwillworkonanydataset.SoIencourageyoutotryitoutandgivemesomefeedbackonwhichuse-casesyoufoundinteresting.Asalways,thecodeisavailableonGitHub.ReferencesBordes,A.,Usunier,N.,Garcia-Duran,A.,Weston,J.,&Yakhnenko,O.(2013).TranslatingEmbeddingsforModelingMulti-relationalData.InAdvancesinNeuralInformationProcessingSystems.CurranAssociates,Inc..Himmelstein,DanielScottetal.“Systematicintegrationofbiomedicalknowledgeprioritizesdrugsforrepurposing.”eLifevol.6e26726.22Sep.2017,doi:10.7554/eLife.26726Ali,M.,Berrendorf,M.,Hoyt,C.,Vermue,L.,Galkin,M.,Sharifzadeh,S.,Fischer,A.,Tresp,V.,&Lehmann,J.(2020).BringingLightIntotheDark:ALarge-scaleEvaluationofKnowledgeGraphEmbeddingModelsUnderaUnifiedFramework.arXivpreprintarXiv:2006.13365.ZhiqingSun,Zhi-HongDeng,Jian-YunNie,&JianTang.(2019).RotatE:KnowledgeGraphEmbeddingbyRelationalRotationinComplexSpace.Du,Fengetal.“SOX12promotescolorectalcancercellproliferationandmetastasisbyregulatingasparaginesynthesis.”Celldeath&diseasevol.10,3239.11Mar.2019,doi:10.1038/s41419–019–1481–9ThankstoLudovicBenistant--1MorefromTowardsDataScienceFollowYourhomefordatascience.AMediumpublicationsharingconcepts,ideasandcodes.ReadmorefromTowardsDataScienceRecommendedfromMediumT.FergusoninCodeXHowtoWorkasaDataScientistataFortune500CompanyTimothyTaninTowardsDataScienceEvolutionofLanguageModels:N-Grams,WordEmbeddings,Attention&TransformersLucaSilipoDataEquality:DataforAll,notDataforNoneSerbanTanasainUpsideEngineeringBlogMaximizingtheValueofBigDataFuatKARABostonAirbnbdatastatisticisn’tinterestingtoyou ?LevertonlimaboeiAGuidetoUnderstandingPandemicPredictionsJalenCharlesAreWeNotentertained!?SeanPancirovDemandForDataIsOnTheRise.LearnWhereToMonetizeItAboutHelpTermsPrivacyGettheMediumappGetstartedTomazBratanic2.1KFollowersDataexplorer.Turneverythingintoagraph.AuthorofGraphalgorithmsforDataScienceatManningpublication.FollowMorefromMediumTomazBratanicinTowardsDataScienceRepresentUnitedKingdom’spublicrecordasaknowledgegraphZachBlumenfeldExploringFraudDetectionWithNeo4j&GraphDataScience — Part4DimitrisPanagopoulosDemonstratingMLongraphsRamSemanticanalysisbetweendocumentsHelpStatusWritersBlogCareersPrivacyTermsAboutKnowable
延伸文章資訊
- 1Build knowledge graph using python - Kaggle
A Knowledge Graph is a set of data points connected by relations that describe a domain, for inst...
- 2Building Knowledge Graph From Text - Analytics Vidhya
Knowledge Graph – A Powerful Data Science Technique to Mine Information from Text (with Python co...
- 3GraphGen4Code | A Toolkit for Generating Code Knowledge ...
Knowledge graphs have been proven extremely useful in powering diverse ... applying it to 1.3 mil...
- 4Knowledge graph completion with PyKEEN and Neo4j
PyKEEN is a Python library that features knowledge graph embedding models and simplifies multi-cl...
- 5KGCNs: Machine Learning over Knowledge Graphs with ...
... model: the Knowledge Graph Convolutional Network (KGCN), available free to use from the GitHu...