Tuesday, March 5, 2013
Machine learning methods of parsing strings
Essential the data is dirty: It arises from many countries in many languages, Written in other ways, Is made out of misspellings, Is devoid pieces, Has different junk, Consequently on.
Right now our approach is to use rules blended with fuzzy gazetteer matching, But we'd like for more information on machine learning techniques. We have labeled training data for monitored learning. You think that, What kinds of machine learning problem is this? It won't seem to be clustering, Or group, In addition regression.
The closest I can established would be classifying each token, But then you eagerly to classify them all simultaneously, Satisfying regulations like "There needs to be at most one country, And really loads of ways to tokenize a string, And you have to try each one and pick the best. I know you will find there's thing called statistical parsing, But need ideas of anything about it.
In order: What machine learning techniques could I search for parsing addresses?
I am not a specialized on your high-Level problem as to post a remedy, But I think the first step to machine learning is building instructive features, Then a method that is right given their structure. You now have the lot of structure; Alnum vs low-Alnum chars, Number vs alpha tokens, Token counts inside splits, Number token lengths. Split on and count the volume of tokens in each split (Home address vs city/state vs geo specific info); Calc strlen of the number tokens (Home address vs zip code). These award you with features you can cluster on. Muratoa august 28 '12 at 15:05
I am very interested in this type of problem as well - which is constructing a mailing addresss into its component parts. We want to do this in a Compares Two Pieces Of Text And Calcs The Difference. Ideal For Rewriting PLR Articles & Other Content. Change Advisor Provides Insight Into Which Areas Of Content Should Be Changed To Reach advertise for free Target % Unique. Affiliate Support At Dupecop Desktop V2 mobile device with no presumptions on connectivity to a reverse geo-Coding service akin to googles. It is ok to assume that we have an onboard method to obtain linked data relating city, Situation, Great outdoors and zip. Any assist in - either female - or willing to engage with a crazy startup team on this difficulty is heartily and openly welcome. User17509 Dec 5 '12 during the 13:45
This is actually as a sequence labeling problem, In that you've a sequence of tokens and want to give a classification for each one. You can utilize hidden Markov models (HMM) Or conditional non linear fields (CRF) To solve dilemma. You can utilize good implementations of HMM and CRF in an open-Source guide called Mallet
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment