Computer lessons

Basic use of SimpleXML. Parsing XML Example #6 Comparing elements and attributes with text

Stage 1. Passing testing (interaction with the GIS GMP test circuit) #GIS GMP test service address:
gisgmp.wsdlLocation=http://213.59.255.182:7777/gateway/services/SID0003663?wsdl
gisgmp.wsdlLocation.endPoint=http://213.59.255.182:7777/gateway/services/SID0003663
This address is registered in the SP settings. Additionally, you need to register it in the logging settings file, specifying the value TRACE. After entering the specified values, you need to launch the SP and the ACC client (restart if it has already been launched). Next, from the ROR or the Accounting Office/AU Application for the payment of funds, you need to perform the action “Create Payment Information”, if the system controls are passed, then the Information will be created about payment. Which will later need to be unloaded.
After uploading, you need to check the status using the “Request processing status” action. After which the ED Payment Information switches to the status “Accepted by GIS GMP” -…

Given: MSG (messages) table with many entries.
CREATETABLEmsg(idINTEGERNOTNULLPRIMARYKEY,descriptionCHAR(50)NOTNULL, date_createDATE);
Task:
It is necessary to clear the table of data/
Solution: There are several ways to solve this problem. Below is a description and example of each of them.
The easiest way ( first option) - execution of the record deletion operator. When you execute it, you will see the result (how many records were deleted). A handy thing when you need to know for sure and understand whether the correct data has been deleted. BUT has disadvantages compared to other options for solving the problem.

DELETE FROMmsg;--Deletes all rows in the table --Deletes all rows with creation date "2019.02.01" DELETE FROMmsg WHEREdate_create="2019.02.01";

Second option. Using the DML statement to clear all rows in a table.
TRUNCATETABLEmsg;
There are several features of using this operator:
It is not available in Firebird, so we use the first and third options. After completion…

Current addresses for requests to SMEV 3.0 We remind you that, in accordance with previously published information on the SMEV 3.0 Technology Portal, it is necessary to use the current addresses for the Unified Electronic Service:
the address of the unified electronic service of the SMEV 3.0 development environment, corresponding to scheme 1.1 - http://smev3-d.test.gosuslugi.ru:7500/smev/v1.1/ws?wsdl, and the service will also be available at

XML Extensible Markup Language is a set of rules for encoding documents in machine-readable form. XML is a popular format for exchanging data on the Internet. Sites that frequently update their content, such as news sites or blogs, often provide an XML feed so that external programs are aware of content changes. Sending and parsing XML data is a common task for network-connected applications. This lesson explains how to parse XML documents and use their data.

Choosing a Parser

Channel Analysis

The first step in parsing a feed is to decide which data fields you are interested in. The parser extracts the given fields and ignores everything else.

Here is a snippet of the channel that will be explored in the example application. Every post on StackOverflow.com appears in a feed as an entry tag, which contains several subtags:

newest questions tagged android - Stack Overflow ... ... http://stackoverflow.com/q/9439999 0 Where is my data file? cliff2310 http://stackoverflow.com/users/1128925 2012-02-25T00:30:54Z 2012-02-25T00:30:54Z

I have an Application that requires a data file...

... ...

The sample application retrieves data from the entry tag and its subtags title , link , and summary .

Creating a parser instance

The next step is to instantiate the parser and start the parsing process. This snippet initializes the parser to not handle namespaces and to use the provided InputStream as input. The parsing process starts with a call to nextTag() and calls the readFeed() method, which retrieves and processes the data that the application is interested in:

Public class StackOverflowXmlParser ( // We don't use namespaces private static final String ns = null; public List parse(InputStream in) throws XmlPullParserException, IOException ( try ( XmlPullParser parser = Xml.newPullParser(); parser.setFeature(XmlPullParser.FEATURE_PROCESS_NAMESPACES , false); parser.setInput(in, null); parser.nextTag(); return readFeed(parser); ) finally ( in.close(); ) ) ... )

Subtract channel

The readFeed() method does the actual work of processing the feed. Elements marked with the "entry" tag are the starting point for recursive processing of the channel. If the next tag is not an entry tag, it is skipped. After the entire "feed" has been recursively processed, readFeed() returns a List containing the entries (including nested data items) that are retrieved from the feed. This List is then returned by the parser.

Private List readFeed(XmlPullParser parser) throws XmlPullParserException, IOException ( List entries = new ArrayList (); parser.require(XmlPullParser.START_TAG, ns, "feed"); while (parser.next() != XmlPullParser.END_TAG) ( if (parser.getEventType() != XmlPullParser.START_TAG) ( continue; ) String name = parser.getName(); // Starts by looking for the entry tag if (name.equals("entry")) ( entries.add( readEntry(parser)); ) else ( skip(parser); ) ) return entries; )

XML parsing

The steps to parse the XML feed are as follows:

This snippet shows how the parser parses entry, title, link, and summary.

Public static class Entry ( public final String title; public final String link; public final String summary; private Entry(String title, String summary, String link) ( this.title = title; this.summary = summary; this.link = link ; ) ) // Parses the contents of an entry. If it encounters a title, summary, or link tag, hands them off // to their respective "read" methods for processing. Otherwise, skip the tag. private Entry readEntry(XmlPullParser parser) throws XmlPullParserException, IOException ( parser.require(XmlPullParser.START_TAG, ns, "entry"); String title = null; String summary = null; String link = null; while (parser.next() ! = XmlPullParser.END_TAG) ( if (parser.getEventType() != XmlPullParser.START_TAG) ( continue; ) String name = parser.getName(); if (name.equals("title")) ( title = readTitle(parser) ; ) else if (name.equals("summary")) ( summary = readSummary(parser); ) else if (name.equals("link")) ( link = readLink(parser); ) else ( skip(parser) ; ) ) return new Entry(title, summary, link); ) // Processes title tags in the feed. private String readTitle(XmlPullParser parser) throws IOException, XmlPullParserException ( parser.require(XmlPullParser.START_TAG, ns, "title"); String title = readText(parser); parser.require(XmlPullParser.END_TAG, ns, "title"); return title; ) // Processes link tags in the feed. private String readLink(XmlPullParser parser) throws IOException, XmlPullParserException ( String link = ""; parser.require(XmlPullParser.START_TAG, ns, "link"); String tag = parser.getName(); String relType = parser.getAttributeValue(null , "rel"); if (tag.equals("link")) ( if (relType.equals("alternate"))( link = parser.getAttributeValue(null, "href"); parser.nextTag(); ) ) parser.require(XmlPullParser.END_TAG, ns, "link"); return link; ) // Processes summary tags in the feed. private String readSummary(XmlPullParser parser) throws IOException, XmlPullParserException ( parser.require(XmlPullParser.START_TAG, ns, "summary"); String summary = readText(parser); parser.require(XmlPullParser.END_TAG, ns, "summary"); return summary; ) // For the tags title and summary, extracts their text values. private String readText(XmlPullParser parser) throws IOException, XmlPullParserException ( String result = ""; if (parser.next() == XmlPullParser.TEXT) ( result = parser.getText(); parser.nextTag(); ) return result; ) ... )

Skipping items you don't need

In one of the XML parsing steps described above, the parser skips tags that we are not interested in. Below is the parser code for the skip() method:

Private void skip(XmlPullParser parser) throws XmlPullParserException, IOException ( if (parser.getEventType() != XmlPullParser.START_TAG) ( throw new IllegalStateException(); ) int depth = 1; while (depth != 0) ( switch (parser. next()) ( case XmlPullParser.END_TAG: depth--; break; case XmlPullParser.START_TAG: depth++; break; ) ) )

Here's how it works:

  • The method throws an exception if the current event is not START_TAG .
  • It consumes START_TAG, and all events up to END_TAG.
  • To make sure it stops at the correct END_TAG and not the first tag after the original START_TAG, it keeps track of the nesting depth.

Thus, if the current element has nested elements, the value of depth will not be 0 until the parser has processed all events between the original START_TAG and its corresponding END_TAG . For example, consider how the analyzer passes an element that has 2 nested elements, And :

  • On the first pass through the while loop, the next tag that the analyzer encounters after this is START_TAG for
  • On the second pass through the while loop, the next tag the analyzer encounters is END_TAG
  • On the third pass through the while loop, the next tag the analyzer encounters is START_TAG . The depth value is increased to 2.
  • On the fourth pass through the while loop, the next tag the analyzer encounters is END_TAG. The depth value is reduced to 1.
  • On the fifth and final pass through the while loop, the next tag the analyzer encounters is END_TAG. The depth value is reduced to 0, indicating that the element was successfully skipped.

XML Data Processing

The sample application receives and parses an XML feed in an AsyncTask. Processing occurs outside of the main UI thread. When processing is complete, the application updates the user interface in the main activity (NetworkActivity).

In the snippet below, the loadPage() method does the following:

  • Initializes a string variable with a URL pointing to an XML feed.
  • If the user settings and network connection allow, calls new DownloadXmlTask().execute(url) . This creates a new DownloadXmlTask ​​object (AsyncTask subclass) and executes its execute() method, which downloads and parses the pipe and returns a string result that will be displayed in the UI.
public class NetworkActivity extends Activity ( public static final String WIFI = "Wi-Fi"; public static final String ANY = "Any"; private static final String URL = "http://stackoverflow.com/feeds/tag?tagnames=android&sort =newest"; // Whether there is a Wi-Fi connection. private static boolean wifiConnected = false; // Whether there is a mobile connection. private static boolean mobileConnected = false; // Whether the display should be refreshed. public static boolean refreshDisplay = true; public static String sPref = null; ... // Uses AsyncTask to download the XML feed from stackoverflow.com. public void loadPage() ( if((sPref.equals(ANY)) && (wifiConnected || mobileConnected )) ( new DownloadXmlTask().execute(URL); ) else if ((sPref.equals(WIFI)) && (wifiConnected)) ( new DownloadXmlTask().execute(URL); ) else ( // show error ) )
  • doInBackground() executes the loadXmlFromNetwork() method. It passes the channel URL as a parameter. The loadXmlFromNetwork() method receives and processes the channel. When it finishes processing, it passes back the resulting string.
  • onPostExecute() takes the returned string and displays it in the UI.
// Implementation of AsyncTask used to download XML feed from stackoverflow.com. private class DownloadXmlTask ​​extends AsyncTask ( @Override protected String doInBackground(String... urls) ( try ( return loadXmlFromNetwork(urls); ) catch (IOException e) ( return getResources().getString(R.string.connection_error); ) catch (XmlPullParserException e) ( return getResources().getString(R.string.xml_error); ) ) @Override protected void onPostExecute(String result) ( setContentView(R.layout.main); // Displays the HTML string in the UI via a WebView WebView myWebView = (WebView) findViewById(R.id.webview); myWebView.loadData(result, "text/html", null); ) )

Below is the loadXmlFromNetwork() method which is called from DownloadXmlTask. It does the following:

  1. Creates an instance of StackOverflowXmlParser. It also creates variables for List Entry objects, and title, url, and summary, to store the values ​​extracted from the XML feed for these fields.
  2. Calls downloadUrl() which downloads the channel and returns it as an InputStream.
  3. Uses StackOverflowXmlParser to parse an InputStream. StackOverflowXmlParser populates List entries with data from the feed.
  4. Processes entries List , and combines channel data with HTML markup.
  5. Returns the HTML string displayed in the UI of the main activity, AsyncTask, in the onPostExecute() method.
// Uploads XML from stackoverflow.com, parses it, and combines it with // HTML markup. Returns HTML string. private String loadXmlFromNetwork(String urlString) throws XmlPullParserException, IOException ( InputStream stream = null; // Instantiate the parser StackOverflowXmlParser stackOverflowXmlParser = new StackOverflowXmlParser(); List entries = null; String title = null; String url = null; String summary = null; Calendar rightNow = Calendar.getInstance(); DateFormat formatter = new SimpleDateFormat("MMM dd h:mmaa"); // Checks whether the user set the preference to include summary text SharedPreferences sharedPrefs = PreferenceManager.getDefaultSharedPreferences(this); boolean pref = sharedPrefs.getBoolean("summaryPref", false); StringBuilder htmlString = new StringBuilder(); htmlString.append("

" + getResources().getString(R.string.page_title) + "

"); htmlString.append(" " + getResources().getString(R.string.updated) + " " + formatter.format(rightNow.getTime()) + ""); try ( stream = downloadUrl(urlString); entries = stackOverflowXmlParser.parse(stream); // Makes sure that the InputStream is closed after the app is // finished using it. ) finally ( if (stream != null) ( stream.close(); ) ) // StackOverflowXmlParser returns a List (called "entries") of Entry objects. // Each Entry object represents a single post in the XML feed. // This section processes the entries list to combine each entry with HTML markup. // Each entry is displayed in the UI as a link that optionally includes // a text summary. for (Entry entry: entries) ( htmlString.append("

" + entry.title + "

"); // If the user set the preference to include summary text, // adds it to the display. if (pref) ( htmlString.append(entry.summary); ) ) return htmlString.toString(); ) // Given a string representation of a URL, sets up a connection and gets // an input stream. private InputStream downloadUrl(String urlString) throws IOException ( URL url = new URL(urlString); HttpURLConnection conn = (HttpURLConnection) url.openConnection() ; conn.setReadTimeout(10000 /* milliseconds */); conn.setConnectTimeout(15000 /* milliseconds */); conn.setRequestMethod("GET"); conn.setDoInput(true); // Starts the query conn.connect( ); return conn.getInputStream(); )

XML parsing essentially means walking through an XML document and returning the corresponding data. While an increasing number of web services return data in JSON format, most still use XML, so it's important to master XML parsing if you want to use the full range of available APIs.

Using the extension SimpleXML in PHP, which was added back in PHP 5.0, working with XML is very easy and simple. In this article I will show you how to do it.

Basics of use

Let's start with the following example languages.xml:


>

> 1972>
> Dennis Ritchie >
>

> 1995>
> Rasmus Lerdorf >
>

> 1995>
> James Gosling >
>
>

This XML document contains a list of programming languages ​​with some information about each language: the year it was introduced and the name of its creator.

The first step is to load the XML using the functions either simplexml_load_file(), or simplexml_load_string(). As the name of the functions suggests, the first one will load XML from a file, and the second one will load XML from a string.

Both functions read the entire DOM tree into memory and return an object SimpleXMLElement. In the above example, the object is stored in the $languages ​​variable. You can use the functions var_dump() or print_r() to get details about the returned object if you want.

SimpleXMLElement Object
[lang] => Array
[ 0 ] => SimpleXMLElement Object
[@attributes] => Array
[name] => C
[appeared] => 1972
[creator] => Dennis Ritchie
[ 1 ] => SimpleXMLElement Object
[@attributes] => Array
[name] => PHP
[appeared] => 1995
[creator] => Rasmus Lerdorf
[ 2 ] => SimpleXMLElement Object
[@attributes] => Array
[name] => Java
[appeared] => 1995
[creator] => James Gosling
)
)

This XML contains a root element languages, inside which there are three elements lang. Each array element corresponds to an element lang in the XML document.

You can access the properties of an object using the operator -> . For example, $languages->lang will return you a SimpleXMLElement object that matches the first element lang. This object contains two properties: appeared and creator.

$languages ​​-> lang [ 0 ] -> appeared ;
$languages ​​-> lang [ 0 ] -> creator ;

Displaying a list of languages ​​and showing their properties can be done very easily using a standard loop such as foreach.

foreach ($languages ​​-> lang as $lang ) (
printf(
"" ,
$lang [ "name" ] ,
$lang -> appeared ,
$lang -> creator
) ;
}

Notice how I accessed the element's lang attribute name to get the language name. This way you can access any attribute of an element represented as a SimpleXMLElement object.

Working with Namespaces

While working with XML of various web services, you will come across element namespaces more than once. Let's change our languages.xml to show an example of using a namespace:



xmlns:dc =>

> 1972>
> Dennis Ritchie >
>

> 1995>
> Rasmus Lerdorf >
>

> 1995>
> James Gosling >
>
>

Now the element creator fits in the namespace dc which points to http://purl.org/dc/elements/1.1/. If you try to print the language creators using our previous code, it will not work. In order to read element namespaces you need to use one of the following approaches.

The first approach is to use URI names directly in the code when accessing the element namespace. The following example shows how this is done:

$dc = $languages ​​-> lang [ 1 ] - > children( "http://purl.org/dc/elements/1.1/") ;
echo $dc -> creator ;

Method children() takes a namespace and returns child elements that start with a prefix. It takes two arguments, the first of which is the XML namespace, and the second is an optional argument which defaults to false. If the second argument is set to TRUE, the namespace will be treated as a prefix. If FALSE, then the namespace will be treated as a URL namespace.

The second approach is to read the URI names from the document and use them when accessing the element namespace. This is actually a better way to access elements because you don't have to be hardcoded to the URI.

$namespaces = $languages ​​-> getNamespaces (true) ;
$dc = $languages ​​-> lang [ 1 ] -> children ( ($namespaces [ "dc" ] ) ;

echo $dc -> creator ;

Method GetNamespaces() returns an array of prefix names and their associated URIs. It accepts an additional parameter which defaults to false. If you set it like true, then this method will return the names used in the parent and child nodes. Otherwise, it finds namespaces used only in the parent node.

Now you can iterate through the list of languages ​​like this:

$languages ​​= simplexml_load_file ("languages.xml" ) ;
$ns = $languages ​​-> getNamespaces (true ) ;

foreach ($languages ​​-> lang as $lang ) (
$dc = $lang -> children ($ns [ "dc" ] ) ;
printf(
"

%s appeared in %d and was created by %s .

" ,
$lang [ "name" ] ,
$lang -> appeared ,
$dc -> creator
) ;
}

Practical example - Parsing a video channel from YouTube

Let's look at an example that gets an RSS feed from a YouTube channel and displays links to all the videos from it. To do this, please contact the following address:

http://gdata.youtube.com/feeds/api/users/xxx/uploads

The URL returns a list of the latest videos from a given channel in XML format. We will parse the XML and get the following information for each video:

  • Link to video
  • Miniature
  • Name

We'll start by searching and loading the XML:

$channel = "Channel_name" ;
$url = "http://gdata.youtube.com/feeds/api/users/". $channel. "/uploads" ;
$xml = file_get_contents($url);

$feed = simplexml_load_string ($xml) ;
$ns = $feed -> getNameSpaces ( true ) ;

If you look at the XML feed, you can see that there are several elements there entity, each of which stores detailed information about a specific video from the channel. But we only use image thumbnails, video URL and title. These three elements are descendants of the element group, which, in turn, is a child of entry:

>

>



Title… >

>

>

We'll just go through all the elements entry, and for each of them we will extract the necessary information. note that player thumbnail And title are in the media namespace. Thus, we must proceed as in the previous example. We get names from the document and use the namespace when accessing elements.

foreach ($feed -> entry as $entry ) (
$group = $entry -> children ($ns [ "media" ] ) ;
$group = $group -> group ;
$thumbnail_attrs = $group -> thumbnail [ 1 ] -> attributes () ;
$image = $thumbnail_attrs [ "url" ] ;
$player = $group -> player -> attributes () ;
$link = $player [ "url" ] ;
$title = $group -> title ;
printf( "

" ,
$player, $image, $title);
}

Conclusion

Now that you know how to use SimpleXML For parsing XML data, you can improve your skills by parsing different XML feeds with different APIs. But it's important to consider that SimpleXML reads the entire DOM into memory, so if you're parsing a large data set, you may run out of memory. To learn more about SimpleXML read the documentation.


If you have any questions, we recommend using our


publication of this article is permitted only with a link to the website of the author of the article

In this article I will show an example of how to parse a large XML file. If your server (hosting) does not prohibit increasing the running time of the script, then you can parse an XML file weighing at least gigabytes; I personally only parsed files from ozone weighing 450 megabytes.

When parsing large XML files, two problems arise:
1. Not enough memory.
2. There is not enough allocated time for the script to run.

The second problem with time can be solved if the server does not prohibit it.
But the problem with memory is difficult to solve, even if we are talking about your own server, then moving files of 500 megabytes is not very easy, and it’s simply not possible to increase the memory on hosting and VDS.

PHP has several built-in XML processing options - SimpleXML, DOM, SAX.
All of these options are described in detail in many articles with examples, but all examples demonstrate working with a full XML document.

Here is one example, getting an object from an XML file

Now you can process this object, BUT...
As you can see, the entire XML file is read into memory, then everything is parsed into an object.
That is, all data goes into memory and if there is not enough allocated memory, the script stops.

This option is not suitable for processing large files; you need to read the file line by line and process this data one by one.
In this case, the validity check is also carried out as the data is processed, so you need to be able to rollback, for example, delete all data entered into the database in the case of an invalid XML file, or carry out two passes through the file, first read for validity, then read for processing data.

Here is a theoretical example of parsing a large XML file.
This script reads one character at a time from a file, collects this data into blocks and sends it to the XML parser.
This approach completely solves the memory problem and does not cause a load, but aggravates the problem over time. How to try to solve the problem over time, read below.

Function webi_xml ($file)
{

########
### data function

{
print $data ;
}
############################################



{
print $name ;
print_r($attrs);
}


## closing tag function
function endElement ($parser, $name)
{
print $name ;
}
############################################

($xml_parser, "data");

// open the file
$fp = fopen($file, "r");

$perviy_vxod = 1 ; $data = "" ;



{

$simvol = fgetc ($fp); $data .= $simvol ;


if($simvol != ">" ) ( continue;)


echo "

break;
}

$data = "" ;
}
fclose($fp);

Webi_xml("1.xml");

?>

In this example, I put everything into one function webi_xml() and at the very bottom you can see its call.
The script itself consists of three main functions:
1. A function that catches the opening of the startElement() tag
2. A function that catches the closing endElement() tag
3. And the data receiving function data() .

Let's assume that the contents of file 1.xml is a recipe



< title >Simple bread
< ingredient amount = "3" unit = "стакан" >Flour
< ingredient amount = "0.25" unit = "грамм" >Yeast
< ingredient amount = "1.5" unit = "стакан" >Warm water
< ingredient amount = "1" unit = "чайная ложка" >Salt
< instructions >
< step > Mix all ingredients and knead thoroughly.
< step > Cover with a cloth and leave for one hour in a warm room..
< step > Knead again, place on a baking sheet and put in the oven.
< step > Visit site site


We start everything by calling the general function webi_xml ("1.xml" );
Next, the parser starts in this function and converts all tag names to upper case so that all tags have the same case.

$xml_parser = xml_parser_create();
xml_parser_set_option($xml_parser, XML_OPTION_CASE_FOLDING, true);

Now we indicate which functions will work to catch the opening of a tag, closing and processing data

xml_set_element_handler($xml_parser, "startElement", "endElement");
xml_set_character_data_handler($xml_parser, "data");

Next comes the opening of the specified file, iterating through the file one character at a time and each character is added to the string variable until the character is found > .
If this is the very first access to the file, then along the way everything that is unnecessary at the beginning of the file will be deleted, everything that comes before , this is the tag that XML should begin with.
For the first time, a string variable will contain a string

And send it to the disassembler
xml_parse ($xml_parser, $data, feof ($fp));
After processing the data, the string variable is reset and the collection of data into a string begins again and the string is formed for the second time

On the third
</b><br>on the fourth <br><b>Simple bread

Please note that a string variable is always formed from a completed tag > and it is not necessary to send the burglar an open and closed tag with data, for example
Simple bread
It is important for this handler to receive a whole unbroken tag, at least one open tag, and in the next step a closed tag, or immediately receive 1000 lines of a file, it doesn’t matter, the main thing is that the tag does not break, for example

le>Plain bread
This way, it is impossible to send data to the handler, since the tag is torn.
You can come up with your own method of sending data to the handler, for example, collect 1 megabyte of data and send it to the handler to increase speed, just make sure that the tags are always completed and the data can be torn
Simple</b><br><b>bread

Thus, in parts as you wish, you can send a large file to the processor.

Now let's look at how this data is processed and how to obtain it.

Let's start with the opening tags function startElement ($parser, $name, $attrs)
Let's assume that processing has reached the line
< ingredient amount = "3" unit = "стакан" >Flour
Then inside the function the variable $name will be equal to ingredient that is, the name of the open tag (it hasn’t come to closing the tag yet).
Also in this case, an array of attributes of this tag $attrs will be available, which will contain data amount = "3" and unit = "glass".

After this, the data of the open tag was processed by the function data ($parser, $data)
The $data variable will contain everything that is between the opening and closing tags, in our case this is the text Muka

And the processing of our string by the function ends endElement ($parser, $name)
This is the name of the closed tag, in our case $name will be equal to ingredient

And after that everything went in circles again.

The above example only demonstrates the principle of XML processing, but for real application it needs to be modified.
Typically, you have to parse large XML to enter data into the database, and to properly process the data you need to know which open tag the data belongs to, what level of tag nesting and which tags are open in the hierarchy above. With this information, you can process the file correctly without any problems.
To do this, you need to introduce several global variables that will collect information about open tags, nesting and data.
Here's an example you can use

Function webi_xml ($file)
{
global $webi_depth ; // counter to track nesting depth
$webi_depth = 0 ;
global $webi_tag_open ; // will contain an array of currently open tags
$webi_tag_open = array();
global $webi_data_temp ; // this array will contain the data of one tag

####################################################
### data function
function data ($parser, $data)
{
global $webi_depth ;
global $webi_tag_open ;
global $webi_data_temp ;
// add data to the array indicating nesting and currently open tag
$webi_data_temp [ $webi_depth ][ $webi_tag_open [ $webi_depth ]][ "data" ].= $data ;
}
############################################

####################################################
### opening tag function
function startElement ($parser, $name, $attrs)
{
global $webi_depth ;
global $webi_tag_open ;
global $webi_data_temp ;

// if the nesting level is no longer zero, then one tag is already open
// and the data from it is already in the array, you can process it
if ($webi_depth)
{




" ;

print "
" ;
print_r($webi_tag_open); // array of open tags
print "


" ;

// after processing the data, delete it to free up memory
unset($GLOBALS [ "webi_data_temp" ][ $webi_depth ]);
}

// now the next tag is opened and further processing will occur in the next step
$webi_depth++; // increase nesting

$webi_tag_open [ $webi_depth ]= $name ; // add an open tag to the information array
$webi_data_temp [ $webi_depth ][ $name ][ "attrs" ]= $attrs ; // now add tag attributes

}
###############################################

#################################################
## closing tag function
function endElement ($parser, $name) (
global $webi_depth ;
global $webi_tag_open ;
global $webi_data_temp ;

// data processing begins here, for example adding to the database, saving to a file, etc.
// $webi_tag_open contains a chain of open tags by nesting level
// for example $webi_tag_open[$webi_depth] contains the name of the open tag whose information is currently being processed
// $webi_depth tag nesting level
// $webi_data_temp[$webi_depth][$webi_tag_open[$webi_depth]]["attrs"] array of tag attributes
// $webi_data_temp[$webi_depth][$webi_tag_open[$webi_depth]]["data"] tag data

Print "data" . $webi_tag_open [ $webi_depth ]. "--" .($webi_data_temp [ $webi_depth ][ $webi_tag_open [ $webi_depth ]][ "data" ]). "
" ;
print_r ($webi_data_temp [ $webi_depth ][ $webi_tag_open [ $webi_depth ]][ "attrs" ]);
print "
" ;
print_r($webi_tag_open);
print "


" ;

Unset($GLOBALS [ "webi_data_temp" ]); // after processing the data, we delete the entire array with the data, since the tag was closed
unset($GLOBALS [ "webi_tag_open" ][ $webi_depth ]); // delete information about this open tag... since it closed

$webi_depth --; // reduce nesting
}
############################################

$xml_parser = xml_parser_create();
xml_parser_set_option($xml_parser, XML_OPTION_CASE_FOLDING, true);

// indicate which functions will work when opening and closing tags
xml_set_element_handler($xml_parser, "startElement", "endElement");

// specify a function for working with data
xml_set_character_data_handler($xml_parser, "data");

// open the file
$fp = fopen($file, "r");

$perviy_vxod = 1 ; // flag to check the first entry into the file
$data = "" ; // here we collect data from the file in parts and send it to the xml parser

// loop until the end of the file is found
while (! feof ($fp ) and $fp )
{
$simvol = fgetc ($fp); // read one character from the file
$data .= $simvol ; // add this character to the data to be sent

// if the character is not the end tag, then go back to the beginning of the loop and add another character to the data, and so on until the end tag is found
if($simvol != ">" ) ( continue;)
// if the closing tag was found, now we will send this collected data for processing

// check if this is the first entry into the file, then we will delete everything that is before the tag// since sometimes you may encounter garbage before the beginning of the XML (clumsy editors, or the file was received by a script from another server)
if($perviy_vxod ) ( $data = strstr ($data , "

// now throw the data into the xml parser
if (! xml_parse ($xml_parser, $data, feof ($fp))) (

// here you can process and receive validity errors...
// as soon as an error is encountered, parsing stops
echo "
XML Error: " . xml_error_string(xml_get_error_code($xml_parser));
echo "at line" . xml_get_current_line_number ($xml_parser);
break;
}

// after parsing, discard the collected data for the next step of the cycle.
$data = "" ;
}
fclose($fp);
xml_parser_free($xml_parser);
// removing global variables
unset($GLOBALS [ "webi_depth" ]);
unset($GLOBALS [ "webi_tag_open" ]);
unset($GLOBALS [ "webi_data_temp" ]);

Webi_xml("1.xml");

?>

The entire example is accompanied by comments, now test and experiment.
Please note that in the function of working with data, data is not simply inserted into an array, but rather added using " .=" since the data may not arrive in its entirety, and if you just make an assignment, then from time to time you will receive the data in chunks.

Well, that’s all, now there is enough memory when processing a file of any size, but the script’s running time can be increased in several ways.
Insert a function at the beginning of the script
set_time_limit(6000);
or
ini_set ("max_execution_time" , "6000" );

Or add text to the .htaccess file
php_value max_execution_time 6000

These examples will increase the script running time to 6000 seconds.
You can increase the time in this way only when safe mode is turned off.

If you have access to edit php.ini you can increase the time using
max_execution_time = 6000

For example, on the Masterhost hosting, at the time of writing this article, increasing the script time is prohibited, despite safe mode being turned off, but if you are a pro, you can make your own PHP build on the Masterhost, but that is not the subject of this article.

Now we will study working with XML. XML is a format for exchanging data between sites. It is very similar to HTML, but XML allows its own tags and attributes.

Why is XML needed for parsing? Sometimes it happens that the site that you need to parse has an API with which you can get what you want without much effort. Therefore, just a piece of advice - before parsing a site, check whether it has an API.

What is an API? This is a set of functions with which you can send a request to this site and receive the desired response. This answer most often comes in XML format. So let's start studying it.

Working with XML in PHP

Let's say you have XML. It can be in a string, or stored in a file, or served upon request to a specific URL.

Let the XML be stored in a string. In this case, you need to create an object from this string using new SimpleXMLElement:

$str = " Kolya 25 1000 "; $xml = new SimpleXMLElement($str);

Now we have in the variable $xml an object with parsed XML is stored. By accessing the properties of this object, you can access the contents of the XML tags. We’ll look at how exactly below.

If the XML is stored in a file or sent by accessing a URL (which is most often the case), then you should use the function simplexml_load_file, which makes the same object $xml:

Kolya 25 1000

$xml = simplexml_load_file(path to file or URL);

Working methods

In the examples below, our XML is stored in a file or URL.

Let the following XML be given:

Kolya 25 1000

Let's get the employee's name, age and salary:

$xml = simplexml_load_file(path to file or URL); echo $xml->name; //will display "Kolya" echo $xml->age; //will output 25 echo $xml->salary; //will output 1000

As you can see, the $xml object has properties corresponding to the tags.

You may have noticed that the tag does not appear anywhere in the appeal. This is because it is the root tag. You can rename it, for example, to - and nothing will change:

Kolya 25 1000

$xml = simplexml_load_file(path to file or URL); echo $xml->name; //will display "Kolya" echo $xml->age; //will output 25 echo $xml->salary; //will output 1000

There can only be one root tag in XML, just like the in regular HTML.

Let's modify our XML a little:

Kolya 25 1000

In this case, we will get a chain of calls:

$xml = simplexml_load_file(path to file or URL); echo $xml->worker->name; //will display "Kolya" echo $xml->worker->age; //will output 25 echo $xml->worker->salary; //will output 1000

Working with attributes

Let some data be stored in attributes:

Number 1

$xml = simplexml_load_file(path to file or URL); echo $xml->worker["name"]; //will display "Kolya" echo $xml->worker["age"]; //will output 25 echo $xml->worker["salary"]; //will output 1000 echo $xml->worker; //will display "Number 1"

Tags with hyphens

XML allows tags (and attributes) with a hyphen. In this case, accessing such tags occurs like this:

Kolya Ivanov

$xml = simplexml_load_file(path to file or URL); echo $xml->worker->(first-name); //will display "Kolya" echo $xml->worker->(last-name); //will display "Ivanov"

Looping

Let us now have not one employee, but several. In this case, we can iterate over our object using a foreach loop:

Kolya 25 1000 Vasya 26 2000 Peter 27 3000

$xml = simplexml_load_file(path to file or URL); foreach ($xml as $worker) ( echo $worker->name; //will display "Kolya", "Vasya", "Petya" )

From object to normal array

If you're not comfortable working with the object, you can convert it to a normal PHP array using the following trick:

$xml = simplexml_load_file(path to file or URL); var_dump(json_decode(json_encode($xml), true));

More information

Parsing based on sitemap.xml

Often a site has a sitemap.xml file. This file stores links to all pages of the site for ease of indexing by search engines (indexing is essentially site parsing by Yandex and Google).

In general, we shouldn’t worry much about why this file is needed, the main thing is that if it exists, you don’t have to crawl through the pages of the site using any tricky methods, but simply use this file.

How to check the presence of this file: let us parse the site site.ru, then go to site.ru/sitemap.xml in the browser - if you see something, then it’s there, and if you don’t see it, then alas.

If there is a sitemap, then it contains links to all pages of the site in XML format. Calmly take this XML, parse it, separate links to the pages you need in any way convenient for you (for example, by analyzing the URL, which was described in the spider method).

As a result, you get a list of links for parsing; all you have to do is go to them and parse the content you need.

Read more about the sitemap.xml device on Wikipedia.

What should you do next:

Start solving problems using the following link: problems for the lesson.

When you decide everything, move on to studying a new topic.